I just started playing with the pyfasta Python module, which is described as a fast, memory-efficient, pythonic (and command-line) access to fasta sequence files.
It looks very promising for my work, but I am currently unable to use it for a very simple application, namely: I want to treat sequences from a fasta file sequentially, respecting the order of the sequences in the fasta file from which I read them. I have tried the following:
import pyfasta
f = pyfasta.Fasta("coreg.fa")
ncar = 60
with open("output.file", "w") as out_file:
for header in f.keys():
name = str(header)
seq = str(f[header])
out_file.write(name + "\n")
while len(seq) > 0:
out_file.write(seq[:ncar] + "\n")
seq = seq[ncar:]
This simple example only reads my input fasta file and writes them back to an output fasta file. However, the initial order is randomized, since the headers are fitted into a dictionary based class (if I understood well), which modifies the order but permits a faster research time. This is not, however, exactly the behavior I am trying to achieve. So, my question is:
Is there a way to maintain the order from the original file?
(I do not want to iterate on sorted(f.keys())
since this does not give back the original order either.)
Many thanks!
What exactly do you need to do with the sequences? I have a module that creates a class for FASTA files and should be able to maintain order with some small modifications.
I have 20 different uses for these sequences, depending on the project... However, they all have in common that I want to treat the sequences one at a time. The nice feature of pyfasta for me is that I don't have to put all the sequences in an object that has to reside in memory. For multiple Go fasta files, that will matter. Would you post your alternative @nuin ? I am interested to see it. Cheers!
Mine just put all the sequences in the memory. Send me an email and we can talk (nuin AT genedrift DOT org).