Quickly retrieve reference genome sequence within python
0
1
Entering edit mode
6.2 years ago
gewa ▴ 20

Hi all, For a project I'm working on, I need to be able to quickly retrieve the sequence at a given 2kb window of the reference genome hg38 within python. The windows I need might not be consecutive (i.e. one thread might need to get the sequence for chr1:2500-4500 and another might need chrX:0-2000). Currently, the best way I've found is by making system tools from within python to samtools faidx, but this is slowing down my program quite a bit. I'd like to avoid a ton of file i/o as this slows things down, but I don't know an efficient way to have everything I need in memory, either. Any suggestions for an efficent way to do this from within python would be really appreciated. Thanks!

sequence python • 2.7k views
ADD COMMENT
1
Entering edit mode

Pyfaidx?

ADD REPLY
2
Entering edit mode

I would recommend pysam instead. This project is more active.

You just have to create the fasta index file (.fai) by your own using e.g. samtools faidx to have random access.

ADD REPLY
0
Entering edit mode

Agree, use pysam.FastaFile and fetch.

ADD REPLY
0
Entering edit mode

When I've tried using pyfaidx in the past, it's used nearly all my memory to load in the fasta. is this unexpected?

ADD REPLY
1
Entering edit mode

You're likely thinking of biopython. pyfaidx is much more memory efficient.

ADD REPLY

Login before adding your answer.

Traffic: 1661 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6