Question

python: Parsing fasta file

0

Entering edit mode

8.2 years ago

Am.A ▴ 20

Hi all

How I parse FASTA file to get information about gene location ( i.e. get numbers start of gene and the end)?

 >lcl|NC_000913.3_cds_NP_414542.1_1 [gene=thrL] [protein=thr operon leader peptide] [protein_id=NP_414542.1] [location=190..255]
ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA

>lcl|NC_000913.3_cds_NP_414547.1_6 [gene=yaaA] [protein=peroxide resistance protein, lowers intracellular iron] [protein_id=NP_414547.1] [location=complement(5683..6459)]
ATGCTGATTCTTATTTCACCTGCGAAAACGCTTGATTACCAAAGCCCGTTGACCACCACGCGCTATACGC
TGCCGGAGCTGTTAGACAATTCCCAGCAGTTGATCCATGAGGCGCGGAAACTGACGCCTCCGCAGATTAG

gene • 6.7k views

ADD COMMENT • link updated 8.2 years ago by second_exon ▴ 210 • written 8.2 years ago by Am.A ▴ 20

5

Entering edit mode

you can find exactly what you need in previous question
Correct Way To Parse A Fasta File In Python

bonus

read this

https://github.com/mdshw5/pyfaidx

ADD REPLY • link 8.2 years ago by Medhat 9.8k

3

Entering edit mode

Okay, you have my permission to do so.

But what is the question? Have you tried googling?

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

0

Entering edit mode

But you don't have my permission to give OP permission :-)

Unless OP edited the question after you wrote your comment it does appear to have a reasonably clear description. On a serious note, can we have more of what @Medhat did and less of these comments?

ADD REPLY • link 8.2 years ago by GenoMax 147k

0

Entering edit mode

Indeed, the post was edited and didn't contain a question at all when I placed my comment asking about what the question would be. I realize that my answer (with the edited original post) makes me look like a douche.

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

0

Entering edit mode

@Am.a: It generally helps to be explicit about the output you want when you write the original post. For example in this case do you only need

thrL       190..255
yaaA     5683..6459

ADD REPLY • link 8.2 years ago by GenoMax 147k

score 2 · Answer 1 · 2016-09-05

2

Entering edit mode

8.2 years ago

second_exon ▴ 210

If I understood your question correctly, this solution with Python 3.x might help you,

with open("seq.fa") as f:
    for line in f:
        line = line.rstrip()
        if line.startswith('>'):
            line1 = line.split()
            print(": ".join([line1[0], line1[-1].strip('[location=complement()]')])) #add characters you want to strip

Output:

>lcl|NC_000913.3_cds_NP_414542.1_1: 190..255
>lcl|NC_000913.3_cds_NP_414547.1_6: 5683..6459

ADD COMMENT • link 8.2 years ago by second_exon ▴ 210

1

Entering edit mode

Don't write a parser if it already exists... in this case the answer is SeqIO from Biopython

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k