Is There A Precise Specification For Fasta Files?
7
5
Entering edit mode
13.3 years ago
Johnny Brown ▴ 140

I'm wondering if anyone who's written a parser or had to check for the validity of these files has worked out something more specific than the wikipedia entry.

fasta parsing • 6.9k views
ADD COMMENT
0
Entering edit mode

Tim Yates' spec:

ADD REPLY
14
Entering edit mode
13.3 years ago
Neilfws 49k

So far as I know, there is no single authoritative source for FASTA format specification. I normally use the guidelines in section 1 of this BLAST help document from the NCBI.

FASTA is not an especially complicated format:

  • The first line begins with ">"
  • After ">", with no spaces, comes the sequence ID (also containing no spaces)
  • Anything after the ID + whitespace is the sequence Description
  • The sequence itself begins on the next line; must be in a valid alphabet and lines should not exceed 80 characters (but most parsers will read sequence on a single line)

That's more or less it.

ADD COMMENT
0
Entering edit mode

Thanks, that's clear and concise.

You could definitely argue that this is more discussion than necessary for something so simple - my motivation was I had to write a validator/parser and I was frustrated by ambiguity in the wikipedia entry.

ADD REPLY
9
Entering edit mode
13.3 years ago

I would argue that the FASTA format is originally defined as the input format of the program FASTA. So anything that can be parsed by FASTA is valid FASTA format, whereas anything that cannot be parsed by FASTA is not. In other words, the original parser of the format should be viewed as the reference implementation.

ADD COMMENT
7
Entering edit mode
13.3 years ago
Johnny Brown ▴ 140

This is the tentative grammar I've worked out. It's community wiki so anyone who knows better than me can fix it.

<file>     ::= <token> | <token> <file>
<token>    ::= <ignore> | <seq>
<ignore>   ::= <whitespace> | <comment> <newline>
<seq>      ::= <header> <molecule> <newline>
<header>   ::= ">" <arbitrary text> <newline>
<molecule> ::= <mol-line> | <mol-line> <molecule>
<mol-line> ::= <nucl-line> | <prot-line>
<nucl-line>::= "^[ACGTURYKMSWBDHVNX-]+$"
<prot-line>::= "^[ABCDEFGHIKLMNOPQRSTUVWYZX*-]+$"

The sequence alphabet and associated punctuation are the one letter codes defined by IUPAC and IUBMB. For nucleotide sequences see:

For amino-acid sequences see:

In addition 'J' is sued for mass-spec ambiguity between 'I' and 'L', and '*' for a translation stop in translations from nucleotide sequences.

Note: the use of lowercase is recommended for nucleotide sequences and uppercase for amino-acid sequences. However mixed-case is used for a number of purposes, including as a result of filtering for low complexity regions or sequence repeats, to indicate variations (insertions/deletions) or as an indicator of lower sequencing quality.

ADD COMMENT
1
Entering edit mode

Taking neilfws's suggestion, we might define header like this: <header> ::= ">" <seqid> " " <arbitrary text> <newline>; <seqid> = "^[^[:space:]]+$. Also, naturally the "arbitrary" text cannot include a newline.

ADD REPLY
0
Entering edit mode

What's a <comment>?

ADD REPLY
1
Entering edit mode
13.1 years ago
Johnny Brown ▴ 140

may also be useful: NCBI's C++ toolkit includes a class CFastaReader, if you are using C++.

ADD COMMENT
0
Entering edit mode

Also, PyCogent has a parser for fasta files in python. After installing cogent, use it like so:
from cogent.parse.fasta import MinimalFastaParser fna = open("myfastafile.fna") parsed = MinimalFastaParser(fna)

parsed will now be an iterable of tuples (head, body) from the individual reads

ADD REPLY
0
Entering edit mode
12.8 years ago
Johnny Brown ▴ 140

Also, PyCogent has a parser for fasta files in python. After installing cogent, use it like so:

from cogent.parse.fasta import MinimalFastaParser 
fna = open("myfastafile.fna") 
parsed = MinimalFastaParser(fna)

parsed will now be an iterable of tuples (head, body) from the individual reads

ADD COMMENT
0
Entering edit mode
12.8 years ago
Mawe ▴ 90

When working with Python you could use Biopython: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11. It is simple to use. Here they note that FASTA does not specify the sequence alphabet at all.

ADD COMMENT
0
Entering edit mode
12.8 years ago
Woa ★ 2.9k

Somwhere I read, most probably in the bioinformatics book by David Mount, that the sequence string in a Fasta file can end with an optional '*'(asterisk)

ADD COMMENT

Login before adding your answer.

Traffic: 2718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6