Question

Iterating Through Fasta Sequences With Pyfasta Python Module

3

Entering edit mode

15.3 years ago

Eric Normandeau 11k

I just started playing with the pyfasta Python module, which is described as a fast, memory-efficient, pythonic (and command-line) access to fasta sequence files.

It looks very promising for my work, but I am currently unable to use it for a very simple application, namely: I want to treat sequences from a fasta file sequentially, respecting the order of the sequences in the fasta file from which I read them. I have tried the following:

import pyfasta

f = pyfasta.Fasta("coreg.fa")
ncar = 60

with open("output.file", "w") as out_file:
    for header in f.keys():
        name = str(header)
        seq = str(f[header])
        out_file.write(name + "\n")
        while len(seq) > 0:
            out_file.write(seq[:ncar] + "\n")
            seq = seq[ncar:]

This simple example only reads my input fasta file and writes them back to an output fasta file. However, the initial order is randomized, since the headers are fitted into a dictionary based class (if I understood well), which modifies the order but permits a faster research time. This is not, however, exactly the behavior I am trying to achieve. So, my question is:

Is there a way to maintain the order from the original file?

(I do not want to iterate on sorted(f.keys()) since this does not give back the original order either.)

Many thanks!

python fasta • 8.0k views

ADD COMMENT • link updated 21 months ago by Ram 45k • written 15.3 years ago by Eric Normandeau 11k

0

Entering edit mode

What exactly do you need to do with the sequences? I have a module that creates a class for FASTA files and should be able to maintain order with some small modifications.

ADD REPLY • link 15.3 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

I have 20 different uses for these sequences, depending on the project... However, they all have in common that I want to treat the sequences one at a time. The nice feature of pyfasta for me is that I don't have to put all the sequences in an object that has to reside in memory. For multiple Go fasta files, that will matter. Would you post your alternative @nuin ? I am interested to see it. Cheers!

ADD REPLY • link 15.3 years ago by Eric Normandeau 11k

0

Entering edit mode

Mine just put all the sequences in the memory. Send me an email and we can talk (nuin AT genedrift DOT org).

ADD REPLY • link 15.3 years ago by Paulo Nuin ★ 3.7k

Ram · Answer 1 · 2010-04-13

3

Entering edit mode

15.3 years ago

brentp 24k

That's a use-case I never thought of, but, as it happens, you can do that with the index. Mostly, you don't need to know about the index, but since it saves the file position, you can use that to order your chromosomes:

f = pyfasta.Fasta('some.fasta')
sorted_keys = [x[0] for x in sorted(f.index.items(), key=lambda a: a[1][0])]

and that doesn't re-parse the file. and in if you're doing that frequently, you could use a subclass:

class Fasta(pyfasta.Fasta):
    def sorted_keys(self):
        return [x[0] for x in sorted(self.index.items(), key=lambda a: a[1][0])]

ADD COMMENT • link updated 21 months ago by Ram 45k • written 15.3 years ago by brentp 24k

0

Entering edit mode

Thanks Brent! That was nested pretty far! You think it would be easy to add a feature making this possible out-of-the-box and that would not have to much of a time overhead compared to a fasta iterator on the file? Cheers

ADD REPLY • link 15.3 years ago by Eric Normandeau 11k

0

Entering edit mode

will, it's really just in f.index, so that's not nested too far. rather than add more features (which i have to document, test, maintain, and justify), i'd prefer to add it to the wiki or something.

ADD REPLY • link 15.3 years ago by brentp 24k

Ram · Answer 2 · 2011-03-10

1

Entering edit mode

14.4 years ago

krst ▴ 10

If you have Python 2.7+, you could modify pyfasta to the keys in an OrderedDict so that they stay in order.

ADD COMMENT • link updated 21 months ago by Ram 45k • written 14.4 years ago by krst ▴ 10