Question

The most efficient way to take several random reads from CRAM file?

0

Entering edit mode

2.4 years ago

artemd ▴ 20

Hi, I have a very simple question - how I select only a few thousand reads from a large CRAM file in the most efficient way?

I'm talking about thousands of large CRAM files (with millions of reads each) and I want to select random 20,000 reads from each file. I know there are a few commands available that, but I'm guessing most of those are iterating through all of the file, and it would be too slow for working with thousands of CRAM files.

So what's the most efficient way to perform this task from your experience?

sequencing reads CRAM • 1.7k views

ADD COMMENT • link updated 2.4 years ago by Istvan Albert 102k • written 2.4 years ago by artemd ▴ 20

0

Entering edit mode

Are your CRAM files sorted?

If you wanted truly random reads then I am not sure how you can avoid having to convert majority, if not the entire, CRAM file and then picking reads using seqtk or reformat.sh from BBMap suite. If you have access to the right hardware you could start many of these jobs at the same time since they will be bruteforce parallelizable.

Technically CRAM can work with other orders but it can become inefficient due to a large amount of random access across the reference genome. The current implementation of CRAM in htslib 1.0 is also inefficient in size for unsorted data, although this will be rectified in upcoming releases.

I found above quote on a CRAM workflow page here. Not sure if this has been implemented already. If your files are unsorted then taking 20K reads from first 100K may be random enough.

ADD REPLY • link 2.4 years ago by GenoMax 150k

0

Entering edit mode

Thanks for the suggestion GenoMax I'm pretty sure the CRAM files are sorted, meaning it's not viable to take the first X reads.

I came across a tool called "bamdownsamplerandom" https://manpages.ubuntu.com/manpages/impish/man1/bamdownsamplerandom.1.html I only found this documentation page without any further explanations neither I couldn't find any other results for it on the web. I'm wondering if someone used this tool?

ADD REPLY • link 2.4 years ago by artemd ▴ 20

0

Entering edit mode

but I'm guessing most of those are iterating through all of the file

this tool scans all the reads.

ADD REPLY • link 2.4 years ago by Pierre Lindenbaum 166k

score 0 · Answer 1 · 2022-11-20

0

Entering edit mode

2.4 years ago

Istvan Albert 102k

How important is it that the sampling is very accurate vs an estimate?

If you have some expectation of where the reads might be (perhaps you expect a uniform coverage) you could generate random chromosomal intervals and fetch the first read that falls into that interval.

ADD COMMENT • link 2.4 years ago by Istvan Albert 102k

0

Entering edit mode

It's not important to have it accurate, an estimate could be good enough. You have some specific tool in mind?

ADD REPLY • link 2.4 years ago by artemd ▴ 20

0

Entering edit mode

samtools can quickly fetch reads over an interval; thus, all you need is to generate these random intervals as a BED file:

Filtering options (Only include in output reads that...):
  -L, --target[s]-file FILE  ...overlap (BED) regions in FILE

ADD REPLY • link 2.4 years ago by Istvan Albert 102k