randomly shuffle BAM file
7
2
Entering edit mode
9.4 years ago
mparker2 ▴ 20

Is there a tool like bedtools shuffle which I can use to randomly shuffle a bam file? or will I have to convert my bam into a bed and then shuffle it? thanks

bam bedtools shuffle • 9.3k views
ADD COMMENT
0
Entering edit mode

I would use `shuf` for shuffling alignment lines taking care about header: detaching it before and reattaching it after

ADD REPLY
3
Entering edit mode

`shuf` loads everything in memory...

ADD REPLY
0
Entering edit mode

good point, will `sort --random-sort` do the job? (but it'll be slow...)

ADD REPLY
0
Entering edit mode

In case you ever have your bam as GRanges (GenomicRanges package) in R, you may use the sample(GRanges, size, replace=FALSE) command.

ADD REPLY
2
Entering edit mode
9.4 years ago

use https://github.com/lindenb/jvarkit/wiki/Biostar145820 to shuffle the reads with option `-n -1`

ADD COMMENT
0
Entering edit mode

does this program actually move the positions of reads on the chromosome randomly, like I think bedtools shuffle does, or does it just just shuffle the line numbers of reads in the file? its not clear to me from the page

ADD REPLY
0
Entering edit mode

it shuffles the reads in the file.

ADD REPLY
1
Entering edit mode
9.3 years ago

One option is to use bam2bed in conjunction with sample:

$ bam2bed < reads.bam > reads.bed
$ sample -k 1234 reads.bed > random_sample.bed

One advantage of sample over GNU shuf is that shuf loads everything into memory before shuffling, while sample uses reservoir sampling on line offsets so that the memory overhead is much, much lower. Memory usage can be an issue for very large input files.

ADD COMMENT
1
Entering edit mode
7.5 years ago
skanterakis ▴ 130

samtools bamshuf file.bam does the job. It'll also break up your bam in -n (default=64) chunks. Unfortunately though, it wants to shuffle the entire bam and then writes the chunks to disk (does not stream even with the -O option). If you just want to sample a few reads from a large bam quickly, you can "shuffle" some coordinates and then use samtools view file.bam chr:start-start | head -1 to get a read. If you want read pairs, you can either use samtools bamshuf or sort the bam by name(samtools sort -n).

ADD COMMENT
0
Entering edit mode

Just to point out samtools bamshuf is found only up to SAMtools 1.2, from SAMtools 1.3 on, the command is samtools collate. In addition, samtools collate does output to stdout (I didn't test with bamshuf but maybe works too), but it needs to create the intermediary files nonetheless.

samtools collate -u -O file.bam tmp | whatever

This works, but creates tmp.001.bam, tmp.002.bam`, and so on.

ADD REPLY
0
Entering edit mode

It appears that if you run samtools bamshuf on more recent versions of samtools, it translates that to samtools collate automatically.

ADD REPLY
1
Entering edit mode
9.3 years ago
Len Trigg ★ 1.6k

If your BAM file contains paired-end data, you may want to be careful that reads for each arm are in proximity to each other, and I believe the above methods will result in the arms ending up scattered.

For example, a common use case for "shuffling" a BAM is to take an already aligned BAM and shuffle it in order to eliminate biases you get when mapping a bam that is sorted in chromosome order. If this is your motivation for doing the shuffling, you would be best to use samtools bamshuf (which effectively sorts by a hash of the read name, so paired end reads are kept together, but avoiding a merge step that a proper sort-by-name requires).

ADD COMMENT
0
Entering edit mode
9.3 years ago
Ian 6.1k

You could use the bedtools suite of programs.

bamtobed > shuffle > bedtobam

ADD COMMENT
0
Entering edit mode
8.6 years ago
plijnzaad ▴ 40

I tend to specify a fraction of whatever lines I want (which I have to know in advance), and then use rejection sampling (i.e., if (rand() > fraction) { print it }; ).This can use stdin and stdout, is exceedlingly fast, but maintains the original order. If that is undesirable, you can as yet subject the result to sample (which sounds like a useful thing)

ADD COMMENT
0
Entering edit mode
8.6 years ago
plijnzaad ▴ 40

Oh, and I just see that samtools view has an option -s SEED.percentage

   -s FLOAT integer part sets seed of random number generator [0];
               rest sets fraction of templates to subsample [no subsampling]

This is prolly the fastest way to do it.

ADD COMMENT
0
Entering edit mode

I don't think this randomizes the file, but rather takes a random subsample.

ADD REPLY

Login before adding your answer.

Traffic: 1915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6