How To Get Millions of Sequences from Genomic Regions Efficiently
1
0
Entering edit mode
4 months ago
Yacob • 0

So far, I have tried the following python packges and they all have caused me to run out of RAM and are not as efficient as I would like: pysam, pyfaidx, SeqIO.

I have a df with milliions of coordinates that I need the sequences for, but I am unsure how I can do this without reaching the RAM limit. Would bedtools help if I use the terminal?

Any suggestions would be greatly appreciated.

python pandas genome • 287 views
ADD COMMENT
0
Entering edit mode
4 months ago
rfran010 ★ 1.3k

I'm not very familiar, but I have to imagine the command line should be able to do this easily. Maybe depends on size of your regions, but bedtools or samtools should work for you.

I did a test with 4G of RAM (allocated on HPC) and 5.6 million regions (repetitions of mm10 gene gtf). Took a few minutes, but bedtools and samtools ran without issue. Resulting fasta was 17G.

ADD COMMENT

Login before adding your answer.

Traffic: 1703 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6