Question

Methods to subset sequenced experiments to account for differences in read counts?

0

Entering edit mode

11.1 years ago

daniel.soronellas ▴ 330

Hi All,

I'm looking for advice and also to know what people usually do to account for differences between different sequenced experiments. For example if you have 2 ChIP-seq/RNA-seq/... experiments, one with 60M reads and the other with 40M reads, how do you obtain:

BAM/SAM/BED file with the same number of reads?
A normalized WIGGLE/BEDGRAPH/BIGWIG signal track (i.e. RPM units)?

For the first question I searched a bit and found that "samtools view" have the option "-s" which you could indicate the fraction of templates to subsample from the raw mapped file, like this:

samtools view -bh -s 0.66 bam.file > subsampled.bam.file

For the second question I was thinking to take a wiggle file and divide values at each bp by total number of reads and then multiply by 1e6, like this:

awk '{if($1 !~ /^fixed/){OFS="\t"; printf "%0.2f\n",($1/60000000)*1e6}else{print $0}}' raw.fixedStep.wig.file > normalized.wiggle.file

Do you think this methods are good to account for read differences and to normalize signal tracks?

Thanks!

genome sequencing down-sampling • 2.9k views

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by daniel.soronellas ▴ 330

0

Entering edit mode

For #1, why do you need that? It would be easier to use a library normalization and then weight samples accordingly in the followup analysis (this is how most RNAseq analysis packages work).

ADD REPLY • link 11.1 years ago by Devon Ryan 105k

Ram · Answer 1 · 2014-07-02

Have you tried DownsampleSAM from Picard? And it would be not very good idea to "normalize" a bed signal track, you could easily shift some of your peaks below stochastic threshold (1/coverage). I would rather recommend to downsample SAM and then re-run your analysis. In case there is a big difference between initial size and down-sampled size, run down-sampling several times to create replicates. In this case don't forget to change seed (RANDOM_SEED parameter)