Hello,
I have several genomes, each spread across multiple FASTA files. Some of these files contain multiple short sequences, and some of them contain large, chromosome-sized sequences. I need efficient random access to sequence data by FASTA record ID and sequence location (e.g. chr1:12384-12884) from Python.
I am currently using BioPython to read all of the FASTA files for a genome and store the SeqRecords in a dict by FASTA id (with BioPyhton's to_dict). This gives me efficient random access (get the SeqRecord by FASTA id, and then subset it to the part I need), but it requires the whole genome to be loaded into memory. I would like to replace this with on-disk indexing.
BioPython provides on-disk indexing of FASTA files, but it appears to be at a per-record level. I can't ask BioPython to read a small subset of a FASTA record from an indexed FASTA file; I have to get the whole record (which may be many megabytes) from BioPython and then take the part I want and throw out the rest.
Is there a Pyhton library that can index within FASTA records and pull out pieces without having to read the whole record each time? Does BioPython support this use case in a way that I am not aware of (with some sort of lazy-loading SeqRecord or something)?
The fastahack module seems to be working great for me. It has a dependency on cython that isn't listed in its setup.py, but once I figured that out I was able to get it working very quickly. Thanks!