I'm wondering if anyone who's written a parser or had to check for the validity of these files has worked out something more specific than the wikipedia entry.
I'm wondering if anyone who's written a parser or had to check for the validity of these files has worked out something more specific than the wikipedia entry.
So far as I know, there is no single authoritative source for FASTA format specification. I normally use the guidelines in section 1 of this BLAST help document from the NCBI.
FASTA is not an especially complicated format:
That's more or less it.
I would argue that the FASTA format is originally defined as the input format of the program FASTA. So anything that can be parsed by FASTA is valid FASTA format, whereas anything that cannot be parsed by FASTA is not. In other words, the original parser of the format should be viewed as the reference implementation.
This is the tentative grammar I've worked out. It's community wiki so anyone who knows better than me can fix it.
<file> ::= <token> | <token> <file>
<token> ::= <ignore> | <seq>
<ignore> ::= <whitespace> | <comment> <newline>
<seq> ::= <header> <molecule> <newline>
<header> ::= ">" <arbitrary text> <newline>
<molecule> ::= <mol-line> | <mol-line> <molecule>
<mol-line> ::= <nucl-line> | <prot-line>
<nucl-line>::= "^[ACGTURYKMSWBDHVNX-]+$"
<prot-line>::= "^[ABCDEFGHIKLMNOPQRSTUVWYZX*-]+$"
The sequence alphabet and associated punctuation are the one letter codes defined by IUPAC and IUBMB. For nucleotide sequences see:
For amino-acid sequences see:
In addition 'J' is sued for mass-spec ambiguity between 'I' and 'L', and '*' for a translation stop in translations from nucleotide sequences.
Note: the use of lowercase is recommended for nucleotide sequences and uppercase for amino-acid sequences. However mixed-case is used for a number of purposes, including as a result of filtering for low complexity regions or sequence repeats, to indicate variations (insertions/deletions) or as an indicator of lower sequencing quality.
may also be useful: NCBI's C++ toolkit includes a class CFastaReader, if you are using C++.
Also, PyCogent has a parser for fasta files in python. After installing cogent, use it like so:
from cogent.parse.fasta import MinimalFastaParser
fna = open("myfastafile.fna")
parsed = MinimalFastaParser(fna)
parsed will now be an iterable of tuples (head, body) from the individual reads
Also, PyCogent has a parser for fasta files in python. After installing cogent, use it like so:
from cogent.parse.fasta import MinimalFastaParser
fna = open("myfastafile.fna")
parsed = MinimalFastaParser(fna)
parsed will now be an iterable of tuples (head, body) from the individual reads
When working with Python you could use Biopython: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11. It is simple to use. Here they note that FASTA does not specify the sequence alphabet at all.
Somwhere I read, most probably in the bioinformatics book by David Mount, that the sequence string in a Fasta file can end with an optional '*'(asterisk)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Tim Yates' spec: