Question

Correct Way To Parse A Fasta File In Python

27

Entering edit mode

15.1 years ago

Eric Normandeau 11k

Hi,

I have been wondering at the correct approach in Python, maybe using Biopython, of parsing a fasta file without having to place it in memory (eg: NOT having to read it to a list, dictionary or fasta class) before using it.

The desired result would behave like a generator, as in the pseudo-code example below:

fasta_sequences = fasta_generator(input_file) # The function I miss
with open(output_file) as out_file:
    for fasta in fasta_sequences:
        name, sequence = fasta
        new_sequence = some_function(sequence)
        write_fasta(out_file) # Function defined elsewhere

Important aspects are:

Read sequences one at a time
Does not put all the sequences into memory
The approach is safe and well tested

Thanks for your suggestions!

python fasta • 217k views

ADD COMMENT • link updated 18 months ago by Ram 45k • written 15.1 years ago by Eric Normandeau 11k

Ram · Answer 1 · 2010-04-15

50

Entering edit mode

15.1 years ago

Zhaorong ★ 1.4k

I think you can just use Biopython

from Bio import SeqIO

fasta_sequences = SeqIO.parse(open(input_file),'fasta')
with open(output_file) as out_file:
    for fasta in fasta_sequences:
        name, sequence = fasta.id, str(fasta.seq)
        new_sequence = some_function(sequence)
        write_fasta(out_file)

You can take a look at the Biopython tutorial: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11

ADD COMMENT • link updated 18 months ago by Ram 45k • written 15.1 years ago by Zhaorong ★ 1.4k

3

Entering edit mode

I use Biopython all the time, but parsing fasta files is all I ever use it for. :) Two other functions I use for fasta parsing is: SeqIO.to_dict() which builds all sequences into a dictionary and save it in memory SeqIO.index() which builds a dictionary without putting the sequences in memory

I was thinking of looking into Biopython a little deeper, since it offers much more than fasta parsing, but did not get a chance. :(

ADD REPLY • link 15.1 years ago by Zhaorong ★ 1.4k

2

Entering edit mode

Very useful answer from 7 years ago! FYI, in current version of biopython(1.69), fasta.seq.tostring() is obsolete, use str(fasta.seq) instead.

ADD REPLY • link 8.0 years ago by Chen Sun ★ 1.1k

0

Entering edit mode

Very nice example. I knew Biopython offered something of the sort, but had never tried it. @Zhaorong do you have a lot of experience with Biopython? What have you used it for? Cheers

ADD REPLY • link 15.1 years ago by Eric Normandeau 11k

0

Entering edit mode

fasta.seq.tostring() should have been str(fasta.seq). This example is wrong.

ADD REPLY • link 6.2 years ago by scchess ▴ 640

3

Entering edit mode

The example is now wrong, but almost 9 years ago when this was written it was perfectly valid. The Bio.Seq.Seq.tostring() method was removed in the latest Biopython release, and was deprecated a bit before: https://github.com/biopython/biopython/blob/654309121f2cc0c01dfff73cd3dec3a435d76fc2/DEPRECATED.rst#bioseqseqtostring-and-bioseqmutableseqtostring

ADD REPLY • link 6.2 years ago by Matt Shirley 10k

2

Entering edit mode

It is indeed wrong today. I edited the answer since it has been possible to use str(sequence) for a long time now.

ADD REPLY • link 6.2 years ago by Eric Normandeau 11k

1

Entering edit mode

Thanks. I constantly forget that I'm a moderator.

ADD REPLY • link 6.2 years ago by Matt Shirley 10k

Ram · Answer 2 · 2010-06-10

19

Entering edit mode

14.9 years ago

brentp 24k

from the implementation in the link above, i came up with this. it's pretty concise and only consumes 1 record at a time.

from itertools import groupby

def fasta_iter(fasta_name):
    """
    given a fasta file. yield tuples of header, sequence
    """
    fh = open(fasta_name)
    # ditch the boolean (x[0]) and just keep the header or sequence since
    # we know they alternate.
    faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
    for header in faiter:
        # drop the ">"
        header = header.next()[1:].strip()
        # join all sequence lines to one.
        seq = "".join(s.strip() for s in faiter.next())
        yield header, seq

ADD COMMENT • link updated 18 months ago by Ram 45k • written 14.9 years ago by brentp 24k

1

Entering edit mode

I like your use of groupby and lambda

ADD REPLY • link 14.9 years ago by Science_Robot ★ 1.1k

1

Entering edit mode

great solution. How can I make this work in Python3?

ADD REPLY • link 13.5 years ago by User 6377 ▴ 10

1

Entering edit mode

" for python3"
from itertools import groupby

def fasta_iter(fasta_name):
"""
modified from Brent Pedersen
https://www.biostars.org/p/710/
given a fasta file. yield tuples of header, sequence
"""
"first open the file outside "
fh = open(fasta_name)

# ditch the boolean (x[0]) and just keep the header or sequence since
# we know they alternate.
faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))

for header in faiter:
    # drop the ">"
    headerStr = header.__next__()[1:].strip()

    # join all sequence lines to one.
    seq = "".join(s.strip() for s in faiter.__next__())

    yield (headerStr, seq)

fiter = fasta_iter('testset.fas')

for ff in fiter:
    headerStr, seq = ff
    print(headerStr)

ADD REPLY • link updated 18 months ago by Ram 45k • written 10.5 years ago by d.lituiev ▴ 10

0

Entering edit mode

Thanks for the code. I'd like to supplement it for opeing file in binary mode.

fin = open(filename, 'rb')

faiter = (x[1] for x in groupby(fin, lambda line: str(line, 'utf-8')[0] == ">"))
for header in faiter:
    headerStr = str(header.__next__(), 'utf-8')
    long_name = headerStr.strip().replace('>', '')
    name = long_name.split()[0]
    seq = "".join(str(s, 'utf-8').strip() for s in faiter.__next__())
    yield (name, seq, long_name)

ADD REPLY • link 5.9 years ago by niuyw ▴ 40

0

Entering edit mode

Nicely done, thanks for a great example on how to use groupby! I'll try to include it in my code, although I'll probably keep the Biopython fasta parser for now :)

ADD REPLY • link 14.9 years ago by Eric Normandeau 11k

0

Entering edit mode

Awesome approach, helped me a lot.

ADD REPLY • link 8.5 years ago by zebudavid ▴ 10

0

Entering edit mode

When I print "header" I get only the first header whereas I have many header. How do I get a list of all the headers or seqs?

ADD REPLY • link 8.2 years ago by alowi33 ▴ 50

0

Entering edit mode

That's because this is a generator function. Do get all headers you'll have to do something like:

fasta = fasta_iter("hg19.fa")
for header, seq in fasta:   
  print(header)

Or, you can use a package such as https://github.com/mdshw5/pyfaidx and do:

from pyfaidx import Fasta
fasta = Fasta("hg19.fa")
headers =fasta.keys()

ADD REPLY • link updated 18 months ago by Ram 45k • written 8.2 years ago by Matt Shirley 10k

Ram · Answer 3 · 2010-04-15

4

Entering edit mode

15.1 years ago

brentp 24k

In another thread, someone pointed you to this: http://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/ which I think is pretty beautiful. but this is trivial in any library.

ADD COMMENT • link updated 18 months ago by Ram 45k • written 15.1 years ago by brentp 24k

Ram · Answer 4 · 2010-06-10

Here's one I used for a script. I don't know if it's necessarily the best. I've made dozens of different variations.

You do fasta = Fasta(handle) and then for record in fasta to yield Dna objects

class Dna:
    ''' Object representing a FASTA record. '''
    def __init__(self, header, sequence):
        self.head = header
        self.seq = sequence
    def __repr__(self):
        return '[HTML]' % (self.head)
    def __str__(self, separator=''):
        return '>%s\n%s' % (self.head, separator.join(self.seq))
    def __len__(self):
        return len(''.join(self.seq))
    @property
    def sequence(self, separator=''):
        return separator.join(self.seq)

class Fasta:
    ''' A FASTA iterator/generates DNA objects. '''
    def __init__(self, handle):
        self.handle = handle
    def __repr__(self):
        return '[HTML]' % handle
    def __iter__(self):
        header, sequence = '', []
        for line in self.handle:
            if line[0] == '>':
                if sequence: yield Dna(header, sequence)
                header = line[1:-1]
                sequence = []
            else:
                sequence.append(line.strip())
        yield Dna(header, sequence)

Ram · Answer 5 · 2014-07-04

You might check out Pyfaidx: Efficient, "Pythonic" Random Access To Fasta Files Using Samtools-Compatible Indexing. This is a package that I wrote that builds or reads a samtools compatible fasta index and can read arbitrary amounts of the fasta file at once. I've kept it up to date and am always trying to make it more lightweight and "pythonic". Even though it's designed for random access there is an efficient line-based iterator method:

from pyfaidx import Fasta
genes = Fasta('tests/data/genes.fasta')
for line in genes['NM_001282543.1']:
...   print(line)
CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
CGATGCCGGATAATCGGCAGCCGAGGAACCGGCAGCCGAGGATCCGCTCCGGGAACGAGCCTCGTTCCGC
...

Ram · Answer 6 · 2010-04-15

2

Entering edit mode

15.1 years ago

Paulo Nuin ★ 3.7k

A very simple and not really elegant approach I have on Beginning Python for Bioinformatics website:

class FastaSeq:
def __init__(self, name, sequence):
    self.name = name
    self.sequence = sequence

def get_seqs(file):
    items = []
    index = 0
    for line in file:
        if line.startswith(">"):
            if index >= 1:
                items.append(aninstance)
            index+=1
            name = line[:-1]
            seq = ''
            aninstance = FastaSeq(name, seq)
        else:
            seq += line[:-1]
            aninstance = FastaSeq(name, seq)
    items.append(aninstance)

    return items

It works as a learning experience. But you don't want to reinvent the wheel on this one.

ADD COMMENT • link updated 18 months ago by Ram 45k • written 15.1 years ago by Paulo Nuin ★ 3.7k

3

Entering edit mode

Any approach that builds a full list is unfeasible in the world of high throughput sequencing. In general any record based reader should be a generator that can simply be unwound into a list if one so desires.

ADD REPLY • link 15.1 years ago by Istvan Albert 102k

1

Entering edit mode

Strings are immutable, so python has to create a new string object every time you append. It's better to use seq = [], seq.append(line[:-1]), and then ''.join(seq)

ADD REPLY • link updated 6.7 years ago by Ram 45k • written 14.9 years ago by Science_Robot ★ 1.1k

0

Entering edit mode

Thanks @nuin. However, this is exactly what I specified I wanted to avoid, for the reasons mentioned by @Istvan Albert.

ADD REPLY • link 15.1 years ago by Eric Normandeau 11k

0

Entering edit mode

Then just change the return statement to a yield. The code needs to be modified to handle the first sequence.

ADD REPLY • link 15.1 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

I guess you meant: change the items.append(aninstance) to yield aninstance

ADD REPLY • link updated 6.7 years ago by Ram 45k • written 15.1 years ago by Michael Kuhn 5.0k

0

Entering edit mode

yes, that's right. The return statement shouldn't exist then.

ADD REPLY • link 15.1 years ago by Paulo Nuin ★ 3.7k

Ram · Answer 7 · 2015-07-09

I regularly use this one... Its much useful when parsing large FASTA file with multiple entries.

def fasta_reader(filename):
  from Bio.SeqIO.FastaIO import FastaIterator
  with open(filename) as handle:
    for record in FastaIterator(handle):
      yield record

for entry in fasta_reader("file.fasta"):
  print str(entry.id) #This is header of fasta entry
  print str(entry.seq) #This is sequence of specific fasta entry

Ram · Answer 8 · 2021-10-01

0

Entering edit mode

3.6 years ago

yampolsk ▴ 10

Because I teach biology students with no previous coding experience here is a code to read fasta that requires no understanding of somewhat more advanced things. It is by no means efficient for large files, but works for teaching purposes.

ADD COMMENT • link 3.6 years ago by yampolsk ▴ 10

0

Entering edit mode

deflines=[]
sequences=[]
f = open("test.fasta", "r")
morelines = True
while morelines:
    seq=""
    moreseq = True
    while moreseq:
        nxtline = f.readline()
        if not nxtline:
            morelines = False
            break
        elif nxtline[0] != ">":
            seq+=nxtline.strip()
        else: 
            deflines.append(nxtline.strip())
                    moreseq = False     
    sequences.append(seq)
for seq in sequences:
    print(seq)
for d in deflines:
    print(d)
f.close()

ADD REPLY • link updated 18 months ago by Ram 45k • written 3.6 years ago by yampolsk ▴ 10

0

Entering edit mode

oops I don't understand why all end of lines and indents disappeared. If someone knows how to edit this to look like a code please do...

ADD REPLY • link 3.6 years ago by yampolsk ▴ 10