Question

Help me with this python code

0

Entering edit mode

2.7 years ago

arr234 ▴ 40

I have a list of gene sequences with their accession ids in fasta format in one file ("ls_orchid.fasta"). I also have a list of needed accession ids ("ids.txt"). Now I want to collect the gene sequences only for the needed accession numbers i,e for those present in ids.txt.

from Bio import SeqIO
    for seq_record in SeqIO.parse("ls_orchid.fasta","fasta"):
        f=open('ids.txt','r')
        lines=f.readlines()
        for i in lines:
            if (str(i) in seq_record.id) == True:
                print(">"+seq_record.id+"\n"+seq_record.seq)

The above code gives only the fasta file of the last accession number. Please help me resolve this issue.

biopython python • 1.2k views

ADD COMMENT • link 2.7 years ago by arr234 ▴ 40

score 1 · Answer 1 · 2022-03-31

1

Entering edit mode

2.7 years ago

Michael 55k

The point is you have to turn your logic inside out, so iterate over SeqIO records and check if they are in the filter list. Also, as it is now, you are repeatedly parsing the ids.txt which is redundant. But in fact, you should not need a loop at all, you could use a filter, maybe. The following code cuts the task short (possibly in a way a Perl programmer would do it being fond of map and grep functions ;):

from Bio import SeqIO
seqs = SeqIO.parse('test.fasta', 'fasta')
needed = ['abc'] # just an example add your file parsing e.g. needed = open('ids.txt','r').readlines()
for s in list(filter(lambda x: x.id in needed, seqs)): print ('>'+s.id+"\n"+s.seq)

test.fasta:

>abc
abcdefghijklmn
>abd
abddefghijklmn
>this may break
the_quick_brown_fox_jumps_over_the_lazy_dog

ADD COMMENT • link 2.7 years ago by Michael 55k

0

Entering edit mode

I understood the code. I tried to run it by modifying with the file names I have got. But I am not getting any output. It is just blank. Why is it so? Can you please help me?

ADD REPLY • link 2.7 years ago by arr234 ▴ 40

0

Entering edit mode

AttributeError: 'FastaIterator' object has no attribute 'id'

ADD REPLY • link 2.7 years ago by arr234 ▴ 40

0

Entering edit mode

Could you please at first try the following. Copy my code exactly without any adaptation to your local files, save the little test.fasta file as given above, and then run the python code. I should give:

>abc
abcdefghijklmn

I suspect a minor difference in your code. If you still cannot run your code, please post the exact code you used, and your Python and Biopython version. Mine is Python 3.8.5 | packaged by conda-forge | (default, Sep 24 2020, 16:37:41)

ADD REPLY • link 2.7 years ago by Michael 55k

0

Entering edit mode

Okay. I copy pasted the exact code and I got the exact output as mentioned by you. But it works only if the ids.txt has only one id say abc. If i give abc and abd, I am not getting the fasta sequence for for the ids instead I am getting the fasta sequence for the last id in the ids.txt i.e., in this case only for abd. But I want the fasta sequence of abc and abd. How to get that?

ADD REPLY • link 2.7 years ago by arr234 ▴ 40

0

Entering edit mode

Ah, I see, sorry. I did not test the id parsing code because it was in a comment. But readlines retains the linebreak "\n" so that's why it doesn't give any output.

Here is the code with id parsing, try this:

from Bio import SeqIO
seqs = SeqIO.parse('test.fasta', 'fasta')
needed = open('ids.txt','r').read().splitlines()
for s in list(filter(lambda x: x.id in needed, seqs)): print ('>'+s.id+"\n"+s.seq)