Is there a tool like bedtools shuffle which I can use to randomly shuffle a bam file? or will I have to convert my bam into a bed and then shuffle it? thanks
Is there a tool like bedtools shuffle which I can use to randomly shuffle a bam file? or will I have to convert my bam into a bed and then shuffle it? thanks
use https://github.com/lindenb/jvarkit/wiki/Biostar145820 to shuffle the reads with option `-n -1`
One option is to use bam2bed in conjunction with sample:
$ bam2bed < reads.bam > reads.bed
$ sample -k 1234 reads.bed > random_sample.bed
One advantage of sample
over GNU shuf
is that shuf
loads everything into memory before shuffling, while sample
uses reservoir sampling on line offsets so that the memory overhead is much, much lower. Memory usage can be an issue for very large input files.
samtools bamshuf file.bam
does the job. It'll also break up your bam in -n
(default=64) chunks. Unfortunately though, it wants to shuffle the entire bam and then writes the chunks to disk (does not stream even with the -O
option). If you just want to sample a few reads from a large bam quickly, you can "shuffle" some coordinates and then use samtools view file.bam chr:start-start | head -1
to get a read. If you want read pairs, you can either use samtools bamshuf
or sort the bam by name(samtools sort -n
).
Just to point out samtools bamshuf
is found only up to SAMtools 1.2, from SAMtools 1.3 on, the command is samtools collate
. In addition, samtools collate
does output to stdout (I didn't test with bamshuf but maybe works too), but it needs to create the intermediary files nonetheless.
samtools collate -u -O file.bam tmp | whatever
This works, but creates tmp.001.bam
, tmp.002.bam`, and so on.
If your BAM file contains paired-end data, you may want to be careful that reads for each arm are in proximity to each other, and I believe the above methods will result in the arms ending up scattered.
For example, a common use case for "shuffling" a BAM is to take an already aligned BAM and shuffle it in order to eliminate biases you get when mapping a bam that is sorted in chromosome order. If this is your motivation for doing the shuffling, you would be best to use samtools bamshuf
(which effectively sorts by a hash of the read name, so paired end reads are kept together, but avoiding a merge step that a proper sort-by-name requires).
I tend to specify a fraction of whatever lines I want (which I have to know in advance), and then use rejection sampling (i.e., if (rand() > fraction) { print it };
).This can use stdin and stdout, is exceedlingly fast, but maintains the original order. If that is undesirable, you can as yet subject the result to sample (which sounds like a useful thing)
Oh, and I just see that samtools view
has an option -s SEED.percentage
-s FLOAT integer part sets seed of random number generator [0];
rest sets fraction of templates to subsample [no subsampling]
This is prolly the fastest way to do it.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I would use `shuf` for shuffling alignment lines taking care about header: detaching it before and reattaching it after
`shuf` loads everything in memory...
good point, will `sort --random-sort` do the job? (but it'll be slow...)
In case you ever have your bam as GRanges (GenomicRanges package) in R, you may use the
sample(GRanges, size, replace=FALSE)
command.