I'm trying to explore the effect of downsampling on calling ChIP sequencing peaks. In each trial run, I downsample the original bam file to 1%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% of the original bam size and then graph the number of peaks that are called on each of these 12 downsampled files.
example of downsampling code:
java -Xmx1g -jar picardpath/DownsampleSam.jar I=${runpath} O=/dev/stdout PROBABILITY=${prob} VALIDATION_STRINGENCY=SILENT RANDOM_SEED=null | bamToBed -i stdin > ${runname}-d${prob}.bed
documentation for downsampliing parameters: Overview of Picard command-line tools
Then, I repeat this process 10 times, each trial generating its own curve with 12 points. In particular, I set the random seed to null to (in theory) make each trial unique.
However, when examining these 10 trials, oddly, the downsampling has behaved rather strangely in two ways. First, there appears to be vastly different shapes to what is in theory supposed to be an identical process. Second, each of these observed shapes has several trials behaving almost identically--in fact, too identical. How can the extremes be almost no change, to complete utter change?
In principle I'd agree with you Istvan but what's puzzling me is why when the % downsample is 100 we still observe different number of peaks - at a 100% downsample they should all show the same number of peaks.