That depends what you want to do with the data -
If you want to cycle over all sequences in a fasta file, I'd use Biopython's SeqIO.parse("your_file.fasta","fasta")
iterator. This won't load the entire file into memory.
from Bio import SeqIO
total = 0
for seq in SeqIO.parse("file","fasta"):
total += len(seq)
print total
is a roundabout way to count the total length of all sequences in a fasta-file, for example.
If you need random access to entire sequences, I'd use SeqIO.index_db("index", "your_file.fasta", "fasta") - this will create an on-disk SQLite database with the file offsets (here "index"), so it's only slow the first time you run the script, after that it'll use the index instead, and the created index works just like a Python dictionary. Using an index means that you won't have to load the whole file into memory.
from Bio import SeqIO
index = SeqIO.index_db("index_file","file","fasta"):
my_seq = index["contig_1"]
If you need random access to regions of random sequences, I'd use bedtools getfasta.
If you just want to read a single sequence into memory use SeqIO.read()
, which complains if you have more than one sequence in a fasta file, but that will load the whole thing into memory.
I would use pyfaidx for all of these use cases, but I'm biased... It will re-use the samtools index, and provides random access to sequences and sequence substrings using a dictionary interface, so it's one less API to learn, and it's very efficient.
How did I miss this!?
I'll fire my marketing team.