Hi, I have a very simple question - how I select only a few thousand reads from a large CRAM file in the most efficient way?
I'm talking about thousands of large CRAM files (with millions of reads each) and I want to select random 20,000 reads from each file. I know there are a few commands available that, but I'm guessing most of those are iterating through all of the file, and it would be too slow for working with thousands of CRAM files.
So what's the most efficient way to perform this task from your experience?
Are your CRAM files sorted?
If you wanted truly random reads then I am not sure how you can avoid having to convert majority, if not the entire, CRAM file and then picking reads using
seqtk
orreformat.sh
from BBMap suite. If you have access to the right hardware you could start many of these jobs at the same time since they will be bruteforce parallelizable.I found above quote on a CRAM workflow page here. Not sure if this has been implemented already. If your files are unsorted then taking 20K reads from first 100K may be random enough.
Thanks for the suggestion GenoMax I'm pretty sure the CRAM files are sorted, meaning it's not viable to take the first X reads.
I came across a tool called "bamdownsamplerandom" https://manpages.ubuntu.com/manpages/impish/man1/bamdownsamplerandom.1.html I only found this documentation page without any further explanations neither I couldn't find any other results for it on the web. I'm wondering if someone used this tool?
this tool scans all the reads.