Question

How To Get Millions of Sequences from Genomic Regions Efficiently

0

Entering edit mode

11 months ago

Yacob • 0

So far, I have tried the following python packges and they all have caused me to run out of RAM and are not as efficient as I would like: pysam, pyfaidx, SeqIO.

I have a df with milliions of coordinates that I need the sequences for, but I am unsure how I can do this without reaching the RAM limit. Would bedtools help if I use the terminal?

Any suggestions would be greatly appreciated.

python pandas genome • 534 views

ADD COMMENT • link updated 11 months ago by Ram 45k • written 11 months ago by Yacob • 0

score 0 · Answer 1 · 2024-08-04

I'm not very familiar, but I have to imagine the command line should be able to do this easily. Maybe depends on size of your regions, but bedtools or samtools should work for you.

I did a test with 4G of RAM (allocated on HPC) and 5.6 million regions (repetitions of mm10 gene gtf). Took a few minutes, but bedtools and samtools ran without issue. Resulting fasta was 17G.