Hi,
I have hundreds of fasta files containing headers of different species.
I want only those files which have species (Balsam,Fraser, Canaan) sequence in the file.
File looks like this:
>Balsam2432
agcttacttta....
>Canaan3432
ttagttataaa....
>Balsam2435
attgttatta...
>Fraser6776
gttatatta..
So far I come up with a code but it gives me the result for one specie for one file.
from Bio import SeqIO
handle = open("test.fasta", "rU")
organism_name1 = "Canaan"
organism_name2 = "Balsam"
organism_name3 = "Fraser"
for record in SeqIO.parse(handle, "fasta"):
if organism_name1 or organism_name2 or organism_name1 in record.id:
org = record.id, record.seq
#org_seq= record.seq
print org
#print org_seq
with open ("output_file", "w") as output_handle:
SeqIO.write(org, output_handle,"fasta")
handle.close()
output_handle.close()
What i have is probably the novice approach but i get the sequences and ids printed. However, I get error when i write the information in a file with SeqIO.write
I would like to create a directory of files which have the intended species sequence.
Please help me in this regard.
it should be something like this:
adeel.maliks20 : When asking questions of this type please specify if you wish to follow your own solution for a specific reason (if you don't want to give a reason, which would be fine as well) or if you are open to other means of getting to the desired end so people can suggest appropriate solutions. If it can be a combination of both then indicate that.
Thanks for the suggestion.
I am open to other solutions as well :)
Let me explain your problem...
So you are looping over the input fasta, and every time you find a sequence that matches you open the "output_file", erase everything in there and overwrite it with your latest hit. As such only one record will be in the file.
Probably, this could already be solved by changing the mode from
"w"
(writing) to"a"
(appending).So you want every species in a separate file? Then we'll have to add a bit more code to this.
Further comments:
- It's better to put
with open ("output_file", "w") as output_handle:
outside of your loop and just open the file once. Know you open the file everytime you get a match, which is not efficient. - When using thewith open...
synthax there is no need to do output_handle.close(). Thewith
statement takes care of opening and closing. -SeqIO.parse()
also takes a filename as input, so you could also use SeqIO.parse("test.fasta", "fasta") without bothering about opening and closing the file - See also C: FASTA File Parsing with many species for nicer synthax of the if statement for filtering the fasta inputAlways state the exact error message. Also, why use Python when a simple
awk
or a series ofgrep
s would suffice?bioawk
is preferable here, givenawk
can handle the pattern matching andbioawk
can parse the file for you.Thanks for suggestions.
I am trying to learn Python so that's why i thought to use Biopython for this task!