I have a .fasta file which formats like this:
>NC_045512.2 |Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT...
>MW326508.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/TX-DSHS-1443/2020 ORF1ab polyprotein (ORF1ab), ORF1a polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
CTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACT
CGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAG
GACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCG...
While I was searching through the internet, I encountered this link: Correct Way To Parse A Fasta File In Python
Looks like Biopyton can be used to solve it. Yet, I highly doubt it, since in the .fasta file, there are no sequences field / annotation etc to indicate which part of the input file is the sequence part and which part of the input file is the sequence id part. I guess I will write the code starting from the scratch without using Biopython modules. Or is there any suggestions for this task?
A fasta file is a text file containing sequence information, being sequence ids in lines starting with the '>' character and the sequence itself right after the id. In your example, you have 2 sequences NC_045512.2 and MW326508.1 which use the id line to provide annotation right after the '|' character. Although you haven't mentioned what exactly you'd like to do with those sequences, that example should be perfectly parseable by any fasta parser.