I am attempting to run a comparison a batch of alignments (bam format) to assess spread. However, some of my datasets are larger than others and I wish to subsample down to an equal number of reads per sample.
If I wanted to randomly extract 1 Million reads from a bam file is there a method to do this?
Note: I am fully aware of the samtools and picard methods which allow you to reduce by a proportion (i.e. the flag -s 0.33) but that would not result in an fixed number of reads per sample, which is what I need, but a reduced proportion per sample which doesn't help.
ADD COMMENT
• link
updated 2.0 years ago by
Ram
44k
•
written 9.6 years ago by
Daniel
★
4.0k
1
Entering edit mode
Is there a reason you cannot use bash like several have commented on in the posts you link to? For example, see Sample Sam File as it shows how to get 1 million randomly selected reads from your bam file.
ADD REPLY
• link
updated 2.0 years ago by
Ram
44k
•
written 9.6 years ago by
alolex
▴
960
0
Entering edit mode
I was hoping there would be a tool already existing without needing me to create 500gb of temporary files only to compress them again.
BTW, if reformat.sh won't do this then I'm tempted to just write a small program that will subsample a fixed number of alignments (or pairs for PE datasets) from SAM/BAM/CRAM files using reservoir sampling, since it's simple enough to do.
I've tried out BBmap as it purported to do as you said, but got some java errors and given @matted's answer below I've just decided to stick with that as I'm not Java savvy. I tried all combinations of sam and bam for ins and outs too. If of interest:
Wow, yes, I'm an idiot. Didn't read it well enough as I was trying to rush it. It looks like it's running now. Good test of this versus the shuf method below at least, as they're both going simultaneously.
Rule 1: RTFM
ADD REPLY
• link
updated 2.0 years ago by
Ram
44k
•
written 9.6 years ago by
Daniel
★
4.0k
1
Entering edit mode
I'd just like to add one comment...
This command will work fine, with 2 caveats. Firstly, reformat.sh will keep paired reads together for fastq and fasta input, but NOT for sam/bam input. Second, if you have secondary alignments with asterisks instead of bases, that is exactly how they will come out (unless you set primaryonly=t).
Think I'm going to have to walk away from it today and I didn't put a time count on it unfortunately. Will do a test tomorrow.
ADD REPLY
• link
updated 2.0 years ago by
Ram
44k
•
written 9.6 years ago by
Daniel
★
4.0k
0
Entering edit mode
Hi Daniel,
Do you want to have 1M reads or 1M alignments? While sampling the latter you might need to control for the NH tag and the secondary alignment flag. Otherwise, you could extract the unique read ids, shuffle and select 1M reads and filter your BAM file with these ids.
This is based on some of the earlier threads mentioned in the comments, but you can use the GNU shuf utility to get a fixed number of random lines in a file. Recent versions of the tool use reservoir sampling, which will improve efficiency. You could run something like:
shuf -n 1000000 input.sam > output.sam
To get the header correctly, you could do something like:
Thanks, this seems a lot less onerous than when I first looked at the other options with the intermediate files. Pipes save the day. I'm still surprised that there isn't a canonical tool for this though.
ADD REPLY
• link
updated 2.0 years ago by
Ram
44k
•
written 9.6 years ago by
Daniel
★
4.0k
0
Entering edit mode
My guess as to why there's no existing tools to do this (as far as we know) is that random subsampling gives answers that are almost correct, in terms of the number of output reads, and is easier to implement.
If you take the sampled number of reads as a binomial random variable, the standard deviation is pretty tiny compared to the number of reads (at most 1000 reads when sampling 1M, which is 0.1% of the total). In fact, the standard deviation as a fraction of the expected number of sampled reads N is at most 1/sqrt(N). The argument might be that that 0.1% variation shouldn't cause problems, or if it does, you may already be in trouble.
ADD REPLY
• link
updated 2.0 years ago by
Ram
44k
•
written 9.6 years ago by
matted
7.8k
0
Entering edit mode
reformat.sh will give an exact number of reads (or bases), but at the expense of reading the file twice. For an approximate answer, it only has to read the file once, which is faster.
Use sample to sample 1M lines without replacement from the parent, headerless SAM, which uses reservoir sampling of line offsets to get around memory limitations in GNU shuf
Reheader the SAM-formatted sample
Recompress the reheadered, SAM-formatted sample to a BAM file
sample tool looks really cool. but i have a question on samtools reheader nohead_reads_1M_sample.sam reads.bam step, nohead_reads_1M_sample.sam doesn't have header and samtool reheader is supposed to use input sam header to modify input bam file. So this line doesn't make sense, right?
I have tried the suggestion above with a little modification. My case is to sample 20 M reads each file on hundreds of bam files with replacement . The "sample" is the best tool.
Finally I lost some reads, because the sampling also happened at the readgroup. But with the bad sampling header, the reads_20M.sam could be coverted to the bam format.
Given Brian's mention of the two-pass nature of reformat.sh for extracting exact numbers of reads, I suspect that this java solution will be the fastest (excluding IO bottlenecks).
I don't think that's right. Glancing through the code, Pierre's solution shuffles (sorts by random indices) the entire BAM (say size N), then takes the first M. So the runtime is O(N*log(N)). The memory usage is trickier to analyze because sorting will go to disk, but it's at least O(M).
Alex's solution (and mine with shuf, assuming a new enough version of shuf) is O(N), with O(M) memory usage by reservoir sampling. And the random sampling methods (with approximate total counts) are O(N) with O(1) memory usage.
ADD REPLY
• link
updated 2.0 years ago by
Ram
44k
•
written 9.6 years ago by
matted
7.8k
0
Entering edit mode
Indeed, I guess I hadn't looked at the code closely enough.
I didn't read the whole thread, but, am i wrong: when using a reservoir sampling with a small number of reads, the output file tends towards containing the last reads in the BAM ? So, if a large bam contains unmapped reads (at the end) and N=10, for example, the output will only contains unmapped reads. ?
With a small N and large number of reads, the probability that rand(1,i)<=N, for the ith entry in the file, will become smaller with increasing i. Actually, this is true for any N.
Is there a reason you cannot use bash like several have commented on in the posts you link to? For example, see Sample Sam File as it shows how to get 1 million randomly selected reads from your bam file.
I was hoping there would be a tool already existing without needing me to create 500gb of temporary files only to compress them again.
Well it depends on the downstream application, but you can often get around writing intermediates to disk by redirection ...
Yeah, I see. I just expected there to be something slick as I imagine it's a common process. Thanks for the info.
reformat.sh
from BBMap can accept asample=
parameter and can likely handle BAM files. Odds are good that that'll work well, but I've never tried.BTW, if
reformat.sh
won't do this then I'm tempted to just write a small program that will subsample a fixed number of alignments (or pairs for PE datasets) from SAM/BAM/CRAM files using reservoir sampling, since it's simple enough to do.I've tried out BBmap as it purported to do as you said, but got some java errors and given @matted's answer below I've just decided to stick with that as I'm not Java savvy. I tried all combinations of sam and bam for ins and outs too. If of interest:
Perhaps this wouldn't yield the error:
Wow, yes, I'm an idiot. Didn't read it well enough as I was trying to rush it. It looks like it's running now. Good test of this versus the shuf method below at least, as they're both going simultaneously.
Rule 1: RTFM
I'd just like to add one comment...
This command will work fine, with 2 caveats. Firstly,
reformat.sh
will keep paired reads together for fastq and fasta input, but NOT for sam/bam input. Second, if you have secondary alignments with asterisks instead of bases, that is exactly how they will come out (unless you setprimaryonly=t
).Cool, I'd be interested in hearing the difference in time required.
Edit: Though really this should be IO bound...
Think I'm going to have to walk away from it today and I didn't put a time count on it unfortunately. Will do a test tomorrow.