Hi all, For a project I'm working on, I need to be able to quickly retrieve the sequence at a given 2kb window of the reference genome hg38 within python. The windows I need might not be consecutive (i.e. one thread might need to get the sequence for chr1:2500-4500 and another might need chrX:0-2000). Currently, the best way I've found is by making system tools from within python to samtools faidx, but this is slowing down my program quite a bit. I'd like to avoid a ton of file i/o as this slows things down, but I don't know an efficient way to have everything I need in memory, either. Any suggestions for an efficent way to do this from within python would be really appreciated. Thanks!
Pyfaidx?
I would recommend pysam instead. This project is more active.
You just have to create the fasta index file (
.fai
) by your own using e.g.samtools faidx
to have random access.Agree, use
pysam.FastaFile
andfetch
.When I've tried using pyfaidx in the past, it's used nearly all my memory to load in the fasta. is this unexpected?
You're likely thinking of biopython. pyfaidx is much more memory efficient.