How can I fetch genomic sequence efficiently using Python? For example, from a .fa file or some other easily obtained format? I basically want an interface fetch_seq(chrom, strand, start, end)
which will return the sequence [start, end]
on the given chromosome on the specified strand.
Analogously, is there a programmatic python interface for getting phastCons scores?
thanks.
Samtools is definitely useful, but it is disappointing to see non-qualified statements such as "Biopython is dead-slow and uses lots of memory." This depends entirely on your usage. You haven't defined what records is, but if it's a SeqIO.index object then you are parsing the record from disk on every call. Try getting the record with 'records[chrom]' then slicing it from memory. If you want to access random sequences on disk with an index from Python, try the 2bit support in bx-python.
Yes, an index, and I figured it is seeking to the slice on every call (there's no disk I/O though - after first call data is cached and is only using CPU cycles). Thanks for suggesting bx-python, I'll have a look at that. I've also edited my answer a bit to not be so peremptory.