pyfaidx is a Python package for random access and indexing of fasta files. It uses the same indexing scheme as samtools faidx
and therefore can generate/use the same "fasta.fai" indices. BioPython has something similar, although to my knowledge it creates an in-memory index of similar structure but does not allow re-using samtools index or writing the new index to a file. FastaHack is a C implementation with similar scope.
The main advantages of pyfaidx are:
- produces and re-uses the samtools fasta index you already have
- import it into your Python 2.6, 2.7, 3.2, 3.3, or pypy script without all of BioPython
- it's entirely Python so you can easily hack and extend it
- the
faidx
cli script is actually faster thansamtools faidx
(see below)
The indexing time for hg19 (~3.0GB) is comparable to samtools:
samtools:
real 0m38.077s user 0m35.326s sys 0m2.340s
real 0m40.551s user 0m37.137s sys 0m2.490s
real 0m39.788s user 0m36.858s sys 0m2.454s
pyfaidx:
real 0m46.735s user 0m43.537s sys 0m2.462s
real 0m54.161s user 0m49.715s sys 0m2.867s
real 0m52.432s user 0m49.056s sys 0m2.703s
Once the index is generated, samtools and pyfaidx access speeds somewhat surprising:
faidx (pyfaidx cli script):
time faidx hs_ref_GRCh37_multi.fa 'gi|224589801|ref|NC_000010.10|:1-100000000' > /dev/null
real 0m1.586s user 0m1.095s sys 0m0.460s
real 0m1.548s user 0m1.079s sys 0m0.446s
real 0m1.505s user 0m1.061s sys 0m0.434s
samtools:
time samtools faidx hs_ref_GRCh37_multi.fa 'gi|224589801|ref|NC_000010.10|:1-100000000' > /dev/null
real 0m8.356s user 0m8.190s sys 0m0.128s
real 0m8.389s user 0m8.205s sys 0m0.149s
real 0m8.236s user 0m8.094s sys 0m0.126s
This is a 5.5X increase in subsequence retrieval speed! Thanks go out to brentp for his contributions.
Pretty neat!
I will mention here that for a FASTA file to work with this method (and this applies to samtools faidx) the fasta files need to be wrapped with the same column length. This condition is usually satisfied but when it is not all kinds of subtle errors could occur.
Yes, and I've been meaning to add this to the documentation as well as add a check during indexing. Thanks for the reminder!
Great stuff. Does pyfaidx also support gzipped fasta files?
Not currently, and probably not ever, but I do think BioPython SeqIO might do something with bgzipped files.
maybe it could be made to work with an razip'ed file. samtools can index and razip'ed fasta, right?
Yes, that is true, and I think this would be feasible but will require me to write a python RAZip implementation, which I would like to do but might not have the time.
Very useful tool with an impressive performance!
I'm extracting from tens to hundreds of millions of regions from a fasta file, and as samtools faidx got too slow, I decided to give pyfaidx a chance. However, it choke completely or alternatively was even slower than samtools faidx (and I got tired of waiting and killed it). Neatly, fastaFromBed from bedtools2 turned out to be about 4x faster than samtools faidx..
Can I ask how you were using pyfaidx? It's possible your performance was due to something other than fetching the sequence...
Like this:
This is not testing the performance of pyfaidx, but rather the performance of loading the Python interpreter and
faidx
program into memory. A better solution would be to pass the regions as a BED file:Of course
bedtools getfasta
is always going to be the fastest for large numbers of regions, since it reads the entire FASTA file into memory and then subsets from in-memory arrays rather than seeking to file offsets - access times here are orders of magnitude different.