Question

How to parse a .fasta file in python ?

0

Entering edit mode

4.3 years ago

2001linana ▴ 40

I have a .fasta file which formats like this:

>NC_045512.2 |Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT...
>MW326508.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/TX-DSHS-1443/2020 ORF1ab polyprotein (ORF1ab), ORF1a polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
CTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACT
CGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAG
GACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCG...

While I was searching through the internet, I encountered this link: Correct Way To Parse A Fasta File In Python

Looks like Biopyton can be used to solve it. Yet, I highly doubt it, since in the .fasta file, there are no sequences field / annotation etc to indicate which part of the input file is the sequence part and which part of the input file is the sequence id part. I guess I will write the code starting from the scratch without using Biopython modules. Or is there any suggestions for this task?

sequence fasta python • 3.7k views

ADD COMMENT • link updated 4.3 years ago by trausch ★ 1.9k • written 4.3 years ago by 2001linana ▴ 40

0

Entering edit mode

A fasta file is a text file containing sequence information, being sequence ids in lines starting with the '>' character and the sequence itself right after the id. In your example, you have 2 sequences NC_045512.2 and MW326508.1 which use the id line to provide annotation right after the '|' character. Although you haven't mentioned what exactly you'd like to do with those sequences, that example should be perfectly parseable by any fasta parser.

ADD REPLY • link 4.3 years ago by Jorge Amigo 14k

score 3 · Answer 1 · 2021-01-04

biopython most assuredly is the 'right' way to parse a file like this for simple applications.

Once you parse a file in using SeqIO.parse(...), you can access the ID with the object.id attribute. This is the string following the > until the first whitespace. Alternatively, you can use the object.description attribute to access the full header string after the >.

If you need to do any more complicated parsing of the headers, you need to do this yourself by applying string manipulation operations to the object.description.

score 1 · Answer 2 · 2021-01-04

Try biopython, from memory they use delimiters such as space to differentiate the ID from the name or annotation fields. It might be helpful to use sed etc in Linux to modify the fasta headers to get the ID and Name to "stick together", depending on what you want to do.

You are right though, fasta headers are notoriously unstructured.

score 0 · Answer 3 · 2021-01-04

I'll link to my answer from almost 7 years ago. Use biopython (pure python iterative parsing), pyfaidx (pure python file offset-based parsing), or pyfastx (C/python file offset-based parsing). I can vouch for the first two methods, and haven't used pyfastx though it looks like a good implementation especially if you need to index FASTQ files as well.

score 0 · Answer 4 · 2021-01-04

0

Entering edit mode

4.3 years ago

trausch ★ 1.9k

readfq supports reading FASTA and FASTQ in various programming languages (incl python).

ADD COMMENT • link 4.3 years ago by trausch ★ 1.9k