Question

Importing big sequences into Python

0

Entering edit mode

8.9 years ago

auryndb ▴ 70

How does one goes importing big sequences into Python (for example a complete genome of one chromosome)? Is there any practical way to import smaller chunks of data instead of loading the whole data into Python? Does it matter? What function do you use? Do you simply use the read function?

sequencing genome • 2.3k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 8.9 years ago by auryndb ▴ 70

Ram · Answer 1 · 2016-02-18

2

Entering edit mode

8.9 years ago

Philipp Bayer 8.8k

That depends what you want to do with the data -

If you want to cycle over all sequences in a fasta file, I'd use Biopython's SeqIO.parse("your_file.fasta","fasta") iterator. This won't load the entire file into memory.

from Bio import SeqIO
total = 0
for seq in SeqIO.parse("file","fasta"):
  total += len(seq)
print total

is a roundabout way to count the total length of all sequences in a fasta-file, for example.

If you need random access to entire sequences, I'd use SeqIO.index_db("index", "your_file.fasta", "fasta") - this will create an on-disk SQLite database with the file offsets (here "index"), so it's only slow the first time you run the script, after that it'll use the index instead, and the created index works just like a Python dictionary. Using an index means that you won't have to load the whole file into memory.

from Bio import SeqIO
index = SeqIO.index_db("index_file","file","fasta"):
my_seq = index["contig_1"]

If you need random access to regions of random sequences, I'd use bedtools getfasta.

If you just want to read a single sequence into memory use SeqIO.read(), which complains if you have more than one sequence in a fasta file, but that will load the whole thing into memory.

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 8.9 years ago by Philipp Bayer 8.8k

1

Entering edit mode

I would use pyfaidx for all of these use cases, but I'm biased... It will re-use the samtools index, and provides random access to sequences and sequence substrings using a dictionary interface, so it's one less API to learn, and it's very efficient.

ADD REPLY • link updated 6.3 years ago by Ram 44k • written 8.9 years ago by Matt Shirley 10k

0

Entering edit mode

How did I miss this!?

It would be rather useful if Biopython's SeqIO.index() could read a .fai file instead of the .fasta, no need to parse the whole file again
— Philipp Bayer (@PhilippBayer) February 9, 2016