FASTA File Parsing with many species
1
1
Entering edit mode
7.7 years ago

Hi,

I have hundreds of fasta files containing headers of different species.

I want only those files which have species (Balsam,Fraser, Canaan) sequence in the file.

File looks like this:

>Balsam2432
agcttacttta....
>Canaan3432
ttagttataaa....
>Balsam2435
attgttatta...
>Fraser6776
gttatatta..

So far I come up with a code but it gives me the result for one specie for one file.

from Bio import SeqIO
handle = open("test.fasta", "rU")
organism_name1 = "Canaan"
organism_name2 = "Balsam"
organism_name3 = "Fraser"
for record in SeqIO.parse(handle, "fasta"):
    if organism_name1 or organism_name2 or organism_name1 in record.id:
        org = record.id, record.seq
        #org_seq= record.seq
        print org
        #print org_seq
    with open ("output_file", "w") as output_handle:
        SeqIO.write(org, output_handle,"fasta")


handle.close()
output_handle.close()

What i have is probably the novice approach but i get the sequences and ids printed. However, I get error when i write the information in a file with SeqIO.write

I would like to create a directory of files which have the intended species sequence.

Please help me in this regard.

sequence fasta python biopython • 2.6k views
ADD COMMENT
1
Entering edit mode

it should be something like this:

organisms = ["Canaan", "Balsam", "Fraser"]


for record in SeqIO.parse(handle, "fasta"):
    if any([record.id.startswith(org) for org in organisms]):
        ...
ADD REPLY
1
Entering edit mode

adeel.maliks20 : When asking questions of this type please specify if you wish to follow your own solution for a specific reason (if you don't want to give a reason, which would be fine as well) or if you are open to other means of getting to the desired end so people can suggest appropriate solutions. If it can be a combination of both then indicate that.

ADD REPLY
0
Entering edit mode

Thanks for the suggestion.

I am open to other solutions as well :)

ADD REPLY
1
Entering edit mode

Let me explain your problem...
So you are looping over the input fasta, and every time you find a sequence that matches you open the "output_file", erase everything in there and overwrite it with your latest hit. As such only one record will be in the file.

Probably, this could already be solved by changing the mode from "w" (writing) to "a" (appending).

I would like to create a directory of files which have the intended species sequence.

So you want every species in a separate file? Then we'll have to add a bit more code to this.

Further comments:
- It's better to put with open ("output_file", "w") as output_handle: outside of your loop and just open the file once. Know you open the file everytime you get a match, which is not efficient. - When using the with open... synthax there is no need to do output_handle.close(). The with statement takes care of opening and closing. - SeqIO.parse() also takes a filename as input, so you could also use SeqIO.parse("test.fasta", "fasta") without bothering about opening and closing the file - See also C: FASTA File Parsing with many species for nicer synthax of the if statement for filtering the fasta input

ADD REPLY
0
Entering edit mode

However, I get error when i write the information in a file with SeqIO.write

Always state the exact error message. Also, why use Python when a simple awk or a series of greps would suffice? bioawk is preferable here, given awk can handle the pattern matching and bioawk can parse the file for you.

ADD REPLY
0
Entering edit mode

Thanks for suggestions.

I am trying to learn Python so that's why i thought to use Biopython for this task!

ADD REPLY
1
Entering edit mode
7.7 years ago

try seqkit

seqkit grep -i -r -p Balsam -p Fraser -p Canaan seqs.fa > results.fa
ADD COMMENT
0
Entering edit mode

Again, stop giving out exact commands.

EDIT: The reason I'm curt above - I posted this literally minutes before your answer: C: Question about searching for/manipulating FASTA sequences in Biopython

ADD REPLY
0
Entering edit mode

I gave python code advise too.

Beginner indeed should learn to solve problems and we can help them find the bugs.

Giving them an instant solution is also a way to help them, it at least saves time. It's OP's choice to decide which way he/she'd like to go.

ADD REPLY
0
Entering edit mode

Giving them an instant solution is not a way of helping them - they should be spending time learning tools, and we should be spending time explaining concepts, not exact syntax.

ADD REPLY
0
Entering edit mode

Thanks for you comments. I get it. I'll change my way.

ADD REPLY
0
Entering edit mode

The point is, the more we give folks commands, the more they expect we keep doing that. We deserve better, and so do they :)

Your tools are really great, and you fit solutions using them nicely to a lot of questions, which is really cool. I'd recommend doing something along the lines of "Try seqkit grep to pick sequences based on a pattern match on the identifier, followed by a seqkit seq to customize output formatting"

I picked that up from your other answer. Lesser effort for you, more to explore for them - which will translate to them reporting more bugs, and them learning more. Win-win all the way!

ADD REPLY
0
Entering edit mode

You'are right! Thanks for your advice.

ADD REPLY
0
Entering edit mode

I'm not the only one who does this way. This's a common way people here to help OP to solve problems.

ADD REPLY
0
Entering edit mode

With all due respect, just because many do it doesn't make it right. I am strongly against copy-paste ready commands, unless they are targeted to a layman user for a complicated problem, such as modifying OS X preferences through the command line for the amateur Mac user.

ADD REPLY
0
Entering edit mode

My apologize. Thanks for you comments. I'll change my way.

ADD REPLY

Login before adding your answer.

Traffic: 1226 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6