Correct Way To Parse A Fasta File In Python
8
27
Entering edit mode
14.6 years ago

Hi,

I have been wondering at the correct approach in Python, maybe using Biopython, of parsing a fasta file without having to place it in memory (eg: NOT having to read it to a list, dictionary or fasta class) before using it.

The desired result would behave like a generator, as in the pseudo-code example below:

fasta_sequences = fasta_generator(input_file) # The function I miss
with open(output_file) as out_file:
    for fasta in fasta_sequences:
        name, sequence = fasta
        new_sequence = some_function(sequence)
        write_fasta(out_file) # Function defined elsewhere

Important aspects are:

  • Read sequences one at a time
  • Does not put all the sequences into memory
  • The approach is safe and well tested

Thanks for your suggestions!

python fasta • 209k views
ADD COMMENT
50
Entering edit mode
14.6 years ago
Zhaorong ★ 1.4k

I think you can just use Biopython

from Bio import SeqIO

fasta_sequences = SeqIO.parse(open(input_file),'fasta')
with open(output_file) as out_file:
    for fasta in fasta_sequences:
        name, sequence = fasta.id, str(fasta.seq)
        new_sequence = some_function(sequence)
        write_fasta(out_file)

You can take a look at the Biopython tutorial: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11

ADD COMMENT
3
Entering edit mode

I use Biopython all the time, but parsing fasta files is all I ever use it for. :) Two other functions I use for fasta parsing is: SeqIO.to_dict() which builds all sequences into a dictionary and save it in memory SeqIO.index() which builds a dictionary without putting the sequences in memory

I was thinking of looking into Biopython a little deeper, since it offers much more than fasta parsing, but did not get a chance. :(

ADD REPLY
2
Entering edit mode

Very useful answer from 7 years ago! FYI, in current version of biopython(1.69), fasta.seq.tostring() is obsolete, use str(fasta.seq) instead.

ADD REPLY
0
Entering edit mode

Very nice example. I knew Biopython offered something of the sort, but had never tried it. @Zhaorong do you have a lot of experience with Biopython? What have you used it for? Cheers

ADD REPLY
0
Entering edit mode

fasta.seq.tostring() should have been str(fasta.seq). This example is wrong.

ADD REPLY
3
Entering edit mode

The example is now wrong, but almost 9 years ago when this was written it was perfectly valid. The Bio.Seq.Seq.tostring() method was removed in the latest Biopython release, and was deprecated a bit before: https://github.com/biopython/biopython/blob/654309121f2cc0c01dfff73cd3dec3a435d76fc2/DEPRECATED.rst#bioseqseqtostring-and-bioseqmutableseqtostring

ADD REPLY
2
Entering edit mode

It is indeed wrong today. I edited the answer since it has been possible to use str(sequence) for a long time now.

ADD REPLY
1
Entering edit mode

Thanks. I constantly forget that I'm a moderator.

ADD REPLY
19
Entering edit mode
14.5 years ago
brentp 24k

from the implementation in the link above, i came up with this. it's pretty concise and only consumes 1 record at a time.

from itertools import groupby

def fasta_iter(fasta_name):
    """
    given a fasta file. yield tuples of header, sequence
    """
    fh = open(fasta_name)
    # ditch the boolean (x[0]) and just keep the header or sequence since
    # we know they alternate.
    faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
    for header in faiter:
        # drop the ">"
        header = header.next()[1:].strip()
        # join all sequence lines to one.
        seq = "".join(s.strip() for s in faiter.next())
        yield header, seq
ADD COMMENT
1
Entering edit mode

I like your use of groupby and lambda

ADD REPLY
1
Entering edit mode

great solution. How can I make this work in Python3?

ADD REPLY
1
Entering edit mode
" for python3"
from itertools import groupby

def fasta_iter(fasta_name):
"""
modified from Brent Pedersen
https://www.biostars.org/p/710/
given a fasta file. yield tuples of header, sequence
"""
"first open the file outside "
fh = open(fasta_name)

# ditch the boolean (x[0]) and just keep the header or sequence since
# we know they alternate.
faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))

for header in faiter:
    # drop the ">"
    headerStr = header.__next__()[1:].strip()

    # join all sequence lines to one.
    seq = "".join(s.strip() for s in faiter.__next__())

    yield (headerStr, seq)

fiter = fasta_iter('testset.fas')

for ff in fiter:
    headerStr, seq = ff
    print(headerStr)
ADD REPLY
0
Entering edit mode

Thanks for the code. I'd like to supplement it for opeing file in binary mode.

fin = open(filename, 'rb')

faiter = (x[1] for x in groupby(fin, lambda line: str(line, 'utf-8')[0] == ">"))
for header in faiter:
    headerStr = str(header.__next__(), 'utf-8')
    long_name = headerStr.strip().replace('>', '')
    name = long_name.split()[0]
    seq = "".join(str(s, 'utf-8').strip() for s in faiter.__next__())
    yield (name, seq, long_name)
ADD REPLY
0
Entering edit mode

Nicely done, thanks for a great example on how to use groupby! I'll try to include it in my code, although I'll probably keep the Biopython fasta parser for now :)

ADD REPLY
0
Entering edit mode

Awesome approach, helped me a lot.

ADD REPLY
0
Entering edit mode

When I print "header" I get only the first header whereas I have many header. How do I get a list of all the headers or seqs?

ADD REPLY
0
Entering edit mode

That's because this is a generator function. Do get all headers you'll have to do something like:

fasta = fasta_iter("hg19.fa")
for header, seq in fasta:   
  print(header)

Or, you can use a package such as https://github.com/mdshw5/pyfaidx and do:

from pyfaidx import Fasta
fasta = Fasta("hg19.fa")
headers =fasta.keys()
ADD REPLY
4
Entering edit mode
14.6 years ago
brentp 24k

In another thread, someone pointed you to this: http://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/ which I think is pretty beautiful. but this is trivial in any library.

ADD COMMENT
4
Entering edit mode
14.5 years ago
Science_Robot ★ 1.1k

Here's one I used for a script. I don't know if it's necessarily the best. I've made dozens of different variations.

You do fasta = Fasta(handle) and then for record in fasta to yield Dna objects

class Dna:
    ''' Object representing a FASTA record. '''
    def __init__(self, header, sequence):
        self.head = header
        self.seq = sequence
    def __repr__(self):
        return '[HTML]' % (self.head)
    def __str__(self, separator=''):
        return '>%s\n%s' % (self.head, separator.join(self.seq))
    def __len__(self):
        return len(''.join(self.seq))
    @property
    def sequence(self, separator=''):
        return separator.join(self.seq)

class Fasta:
    ''' A FASTA iterator/generates DNA objects. '''
    def __init__(self, handle):
        self.handle = handle
    def __repr__(self):
        return '[HTML]' % handle
    def __iter__(self):
        header, sequence = '', []
        for line in self.handle:
            if line[0] == '>':
                if sequence: yield Dna(header, sequence)
                header = line[1:-1]
                sequence = []
            else:
                sequence.append(line.strip())
        yield Dna(header, sequence)
ADD COMMENT
4
Entering edit mode
10.4 years ago

You might check out Pyfaidx: Efficient, "Pythonic" Random Access To Fasta Files Using Samtools-Compatible Indexing. This is a package that I wrote that builds or reads a samtools compatible fasta index and can read arbitrary amounts of the fasta file at once. I've kept it up to date and am always trying to make it more lightweight and "pythonic". Even though it's designed for random access there is an efficient line-based iterator method:

from pyfaidx import Fasta
genes = Fasta('tests/data/genes.fasta')
for line in genes['NM_001282543.1']:
...   print(line)
CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
CGATGCCGGATAATCGGCAGCCGAGGAACCGGCAGCCGAGGATCCGCTCCGGGAACGAGCCTCGTTCCGC
...
ADD COMMENT
2
Entering edit mode
14.6 years ago
Paulo Nuin ★ 3.7k

A very simple and not really elegant approach I have on Beginning Python for Bioinformatics website:

class FastaSeq:
def __init__(self, name, sequence):
    self.name = name
    self.sequence = sequence

def get_seqs(file):
    items = []
    index = 0
    for line in file:
        if line.startswith(">"):
            if index >= 1:
                items.append(aninstance)
            index+=1
            name = line[:-1]
            seq = ''
            aninstance = FastaSeq(name, seq)
        else:
            seq += line[:-1]
            aninstance = FastaSeq(name, seq)
    items.append(aninstance)

    return items

It works as a learning experience. But you don't want to reinvent the wheel on this one.

ADD COMMENT
3
Entering edit mode

Any approach that builds a full list is unfeasible in the world of high throughput sequencing. In general any record based reader should be a generator that can simply be unwound into a list if one so desires.

ADD REPLY
1
Entering edit mode

Strings are immutable, so python has to create a new string object every time you append. It's better to use seq = [], seq.append(line[:-1]), and then ''.join(seq)

ADD REPLY
0
Entering edit mode

Thanks @nuin. However, this is exactly what I specified I wanted to avoid, for the reasons mentioned by @Istvan Albert.

ADD REPLY
0
Entering edit mode

Then just change the return statement to a yield. The code needs to be modified to handle the first sequence.

ADD REPLY
0
Entering edit mode

I guess you meant: change the items.append(aninstance) to yield aninstance

ADD REPLY
0
Entering edit mode

yes, that's right. The return statement shouldn't exist then.

ADD REPLY
0
Entering edit mode
9.4 years ago

I regularly use this one... Its much useful when parsing large FASTA file with multiple entries.

def fasta_reader(filename):
  from Bio.SeqIO.FastaIO import FastaIterator
  with open(filename) as handle:
    for record in FastaIterator(handle):
      yield record

for entry in fasta_reader("file.fasta"):
  print str(entry.id) #This is header of fasta entry
  print str(entry.seq) #This is sequence of specific fasta entry
ADD COMMENT
0
Entering edit mode
3.2 years ago
yampolsk ▴ 10

Because I teach biology students with no previous coding experience here is a code to read fasta that requires no understanding of somewhat more advanced things. It is by no means efficient for large files, but works for teaching purposes.

ADD COMMENT
0
Entering edit mode
deflines=[]
sequences=[]
f = open("test.fasta", "r")
morelines = True
while morelines:
    seq=""
    moreseq = True
    while moreseq:
        nxtline = f.readline()
        if not nxtline:
            morelines = False
            break
        elif nxtline[0] != ">":
            seq+=nxtline.strip()
        else: 
            deflines.append(nxtline.strip())
                    moreseq = False     
    sequences.append(seq)
for seq in sequences:
    print(seq)
for d in deflines:
    print(d)
f.close()

ADD REPLY
0
Entering edit mode

oops I don't understand why all end of lines and indents disappeared. If someone knows how to edit this to look like a code please do...

ADD REPLY

Login before adding your answer.

Traffic: 1728 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6