Question

Strange Behavior In Picard Downsamplesam.Jar Algorithm?

3

Entering edit mode

11.9 years ago

jeffrey ▴ 30

I'm trying to explore the effect of downsampling on calling ChIP sequencing peaks. In each trial run, I downsample the original bam file to 1%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% of the original bam size and then graph the number of peaks that are called on each of these 12 downsampled files.

example of downsampling code:

java -Xmx1g -jar picardpath/DownsampleSam.jar I=${runpath} O=/dev/stdout PROBABILITY=${prob} VALIDATION_STRINGENCY=SILENT RANDOM_SEED=null | bamToBed -i stdin > ${runname}-d${prob}.bed

documentation for downsampliing parameters: Overview of Picard command-line tools

Then, I repeat this process 10 times, each trial generating its own curve with 12 points. In particular, I set the random seed to null to (in theory) make each trial unique.

However, when examining these 10 trials, oddly, the downsampling has behaved rather strangely in two ways. First, there appears to be vastly different shapes to what is in theory supposed to be an identical process. Second, each of these observed shapes has several trials behaving almost identically--in fact, too identical. How can the extremes be almost no change, to complete utter change?

example of 10 downsample trials

picard • 5.0k views

ADD COMMENT • link updated 11.9 years ago by matted 7.8k • written 11.9 years ago by jeffrey ▴ 30

score 1 · Answer 1 · 2013-02-13

A few guesses:

No slight at all to you, but in these situations of really strange results, I tend to most mistrust the outer shell script. Are you sure everything's lined up correctly with $runpath linked to the correct $runname? Is the peak calling script really looking at the right file and going through the different $prob values appropriately? Is there something strange like using the same control channel file for all runs of the peak caller? Purely guessing, but it seems as if there are only 4 input datasets, and the residual randomness might be coming from the peak caller.
If the calling code and input filenames are all correct, is the original bam file sane? What happens if you run with LENIENT or STRICT for VALIDATION_STRINGENCY? If you have things like paired reads without mates in the bam, or multiple records with the same name, you could be breaking the assumptions of DownsampleSam.jar.
Ignoring the peak calling part, do the read counts after bamToBed look sane and follow the expectations of the sampling rates? Are the replicates actually different (say, with different md5sums)?
If everything else looks fine, you could try samtools view -s to sample from a BAM. It looks like it's not on the webpage, but the usage is similar to the Picard tool.

From samtools view -help:

samtools view -s FLOAT         fraction of templates to subsample; integer part as seed [-1]

score 0 · Answer 2 · 2013-02-12

0

Entering edit mode

11.9 years ago

Istvan Albert 102k

I think it all comes down to the definition of a peak as being defined by a group of reads. Varying the number of reads does not lead to a varying number of peaks.

Say a number of reads are next to one another and form a peak. Removing a few of these reads may still produce the exact same peak, but if you removed just one more you might lose that whole peak altogether. Same thing with the background. Removing reads from the background may shift a peak in and out of existence.

ADD COMMENT • link 11.9 years ago by Istvan Albert 102k

0

Entering edit mode

In principle I'd agree with you Istvan but what's puzzling me is why when the % downsample is 100 we still observe different number of peaks - at a 100% downsample they should all show the same number of peaks.

ADD REPLY • link 11.9 years ago by dfernan ▴ 770