Question

randomly shuffle BAM file

2

Entering edit mode

9.8 years ago

mparker2 ▴ 20

Is there a tool like bedtools shuffle which I can use to randomly shuffle a bam file? or will I have to convert my bam into a bed and then shuffle it? thanks

bam bedtools shuffle • 10k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.8 years ago by mparker2 ▴ 20

0

Entering edit mode

I would use `shuf` for shuffling alignment lines taking care about header: detaching it before and reattaching it after

ADD REPLY • link 9.8 years ago by Pavel Senin ★ 1.9k

3

Entering edit mode

`shuf` loads everything in memory...

ADD REPLY • link 9.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

good point, will `sort --random-sort` do the job? (but it'll be slow...)

ADD REPLY • link 9.8 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

In case you ever have your bam as GRanges (GenomicRanges package) in R, you may use the sample(GRanges, size, replace=FALSE) command.

ADD REPLY • link 8.0 years ago by ATpoint 88k

1

Entering edit mode

9.8 years ago

Len Trigg ★ 1.7k

If your BAM file contains paired-end data, you may want to be careful that reads for each arm are in proximity to each other, and I believe the above methods will result in the arms ending up scattered.

For example, a common use case for "shuffling" a BAM is to take an already aligned BAM and shuffle it in order to eliminate biases you get when mapping a bam that is sorted in chromosome order. If this is your motivation for doing the shuffling, you would be best to use samtools bamshuf (which effectively sorts by a hash of the read name, so paired end reads are kept together, but avoiding a merge step that a proper sort-by-name requires).

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.8 years ago by Len Trigg ★ 1.7k

0

Entering edit mode

9.8 years ago

Ian 6.1k

You could use the bedtools suite of programs.

bamtobed > shuffle > bedtobam

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.8 years ago by Ian 6.1k

0

Entering edit mode

9.1 years ago

plijnzaad ▴ 40

I tend to specify a fraction of whatever lines I want (which I have to know in advance), and then use rejection sampling (i.e., if (rand() > fraction) { print it }; ).This can use stdin and stdout, is exceedlingly fast, but maintains the original order. If that is undesirable, you can as yet subject the result to sample (which sounds like a useful thing)

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.1 years ago by plijnzaad ▴ 40

0

Entering edit mode

9.1 years ago

plijnzaad ▴ 40

Oh, and I just see that samtools view has an option -s SEED.percentage

   -s FLOAT integer part sets seed of random number generator [0];
               rest sets fraction of templates to subsample [no subsampling]

This is prolly the fastest way to do it.

ADD COMMENT • link 9.1 years ago by plijnzaad ▴ 40

0

Entering edit mode

I don't think this randomizes the file, but rather takes a random subsample.

ADD REPLY • link 9.1 years ago by Matt Shirley 10k

score 2 · Accepted Answer · 2015-07-17

2

Entering edit mode

9.8 years ago

Pierre Lindenbaum 166k

use https://github.com/lindenb/jvarkit/wiki/Biostar145820 to shuffle the reads with option `-n -1`

ADD COMMENT • link 9.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

does this program actually move the positions of reads on the chromosome randomly, like I think bedtools shuffle does, or does it just just shuffle the line numbers of reads in the file? its not clear to me from the page

ADD REPLY • link 9.8 years ago by mparker2 ▴ 20

0

Entering edit mode

it shuffles the reads in the file.

ADD REPLY • link 9.8 years ago by Pierre Lindenbaum 166k

Ram · Accepted Answer · 2015-07-27

One option is to use bam2bed in conjunction with sample:

$ bam2bed < reads.bam > reads.bed
$ sample -k 1234 reads.bed > random_sample.bed

One advantage of sample over GNU shuf is that shuf loads everything into memory before shuffling, while sample uses reservoir sampling on line offsets so that the memory overhead is much, much lower. Memory usage can be an issue for very large input files.

score 1 · Accepted Answer · 2017-05-12

1

Entering edit mode

8.0 years ago

skanterakis ▴ 130

samtools bamshuf file.bam does the job. It'll also break up your bam in -n (default=64) chunks. Unfortunately though, it wants to shuffle the entire bam and then writes the chunks to disk (does not stream even with the -O option). If you just want to sample a few reads from a large bam quickly, you can "shuffle" some coordinates and then use samtools view file.bam chr:start-start | head -1 to get a read. If you want read pairs, you can either use samtools bamshuf or sort the bam by name(samtools sort -n).

ADD COMMENT • link 8.0 years ago by skanterakis ▴ 130

0

Entering edit mode

Just to point out samtools bamshuf is found only up to SAMtools 1.2, from SAMtools 1.3 on, the command is samtools collate. In addition, samtools collate does output to stdout (I didn't test with bamshuf but maybe works too), but it needs to create the intermediary files nonetheless.

samtools collate -u -O file.bam tmp | whatever

This works, but creates tmp.001.bam, tmp.002.bam`, and so on.

ADD REPLY • link 7.6 years ago by h.mon 35k

0

Entering edit mode

It appears that if you run samtools bamshuf on more recent versions of samtools, it translates that to samtools collate automatically.

ADD REPLY • link 4.8 years ago by alanh ▴ 170