I have a fasta file containning PapillomaViruses sequences (entire genomes, partial CDS, ....) and I'm using biopython to retrieve entire genomes (around 7kb) from this files, so here's my code:
rec_dict = SeqIO.index("hpv_id_name_all.fasta","fasta")
for k in rec_dict.keys():
c=c+1
if len(rec_dict[k].seq)>7000:
handle=open(rec_dict[k].description+"_"+str(len(rec_dict[k].seq))+".fasta","w")
handle.write(">"+rec_dict[k].description+"\n"+str(rec_dict[k].seq)+"\n")
handle.close()
I'm using a dictionary for avoiding loading everything in memory. The variable "c" is used to know how many iterations are made before THIS error pops up:
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
IOError: [Errno 2] No such file or directory: 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta'
When I print the value of "c", I get 9013 while the file contains 10447 sequences, meaning the for loop didn't go through all the sequences (the count is done before the "if" condition, so the I count all the iterations, not only those which match the condition). I don't understand the INPUT/OUTPUT error, it should create the 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta' file instead of verifying its existence, shouldn't it?
You shouldn't use the
SeqIO
index dictionary like this - every time you access a record withrec_dict[k]
then it gets parsed from disk again, which is comparatively slow. Also, by accessing the records from the file in essentially random order (the key order in the dictionary) you are again going to slow down the disk IO. The solution is much simpler - just iterate of the file in one parse.Also you are not using Biopython for the FASTA output, which is sometimes appropriate.
+1 for FASTA line wrapping.
Are you sure? because doing what you've proposed is exactly what i was trying to avoid by using the dictionary. i thought iterate through the file was slower than indexing it and accessing only the record i needed by their key.
In order to get the length of each sequence, you have to look at the entire record for that sequence. So yes, a single parse though the file is the simplest and fastest solution here.
May I add that I've checked for the sequence from which the file should be created (in my code each sequence is written in a fasta file named after the sequence's id in the original fasta file(
hpv_id_name_all.fasta
, see code)). And this sequence ('EU410347.1|Human papillomavirus FA75/KI88-03_7401'
) has no particularity that could explain the IOErrorThis is a duplicate of a question on another site http://stackoverflow.com/questions/34015094/ioerror-while-retrieving-sequences-from-fasta-file-using-biopython - please don't do this (unless you include the link to the other question) as you are wasting volunteers' time helping you if someone else has already answered the question elsewhere.