Should we normalize different numbers of reads?
2
0
Entering edit mode
7.8 years ago
PoGibas 5.1k

I want to normalize data for enrichment based method (eg., MeDIP). MeDIP - captures methylated DNA sites.
Lets say that I have two samples: real sample (targets modified DNA) and dummy control (reads are randomly distributed along the genome). Number of reads in real sample is N times greater than number of reads in dummy control sample.

My question is: should I normalize number of reads between two samples?

case A: On one hand, it is logical to normalize number of reads as I will probably want to compare mean coverage between my samples. In this case, I can divide coverage per CG by total number of reads.
case B: On the other hand, maybe lower number of reads in dummy control is a result of a biological process (eg., in control sample there are no methylated DNA sites, thus no targets to be enriched and that's why we are getting much lower number of reads for this sample).

I know that a common strategy is to normalize number of reads. But what if different number of reads is a biological result? Can we know this? I am interested how community is dealing with this kind of a problem.

ChIP-Seq RNA-Seq sequencing • 2.0k views
ADD COMMENT
0
Entering edit mode
7.8 years ago
Michele Busby ★ 2.2k

No, don't throw out data. You will increase your counting (aka shot, Poisson) noise and diminish the signal from your methylation.

What you want is more like a division of the methylated over the control.

ADD COMMENT
0
Entering edit mode

I will add that the control read distribution isn't random, because this is an important point in understanding the experiment.

There will be spikes in the controls caused by artifacts, e.g. PCR amplification artifacts, dodgy alignments, and chromatin accessibility (especially if there is some sort of size selection).

If you load the bams into IGV you will see this. The control is used to look for enrichment over this background noise. Removing reads will shrink you true and bogus peaks, making them hard to see.

ADD REPLY
0
Entering edit mode
7.8 years ago

But what if different number of reads is a biological result? Can we know this?

At what stage of the analysis you count the number of reads in each sample? If the count is at the level of raw fastq files then the difference is likely due to the cluster density, i.e. nothing biological, just a technical difference in the amount of library loaded on the flow cell.

If fastq files have roughly the same number of reads and quality but the control sample has much more adapter contamination and unmappable sequences, than yes that could be an indication that the pull down in the control didn't pull much because there was no target. It's good to look at the alignment duplication rate. Pull down libraries where very little DNA was captured have very high duplication rate. In this case looking at the bam files in a browser should show stacks of reads at the same position next to regions with no or very few reads.

Having said this, without replicates it's difficult to say anything conclusive since pull down experiments in my experience are quite variable so the difference you see may be just due to technical variability.

Maybe useful to post some actual numbers of read counts, duplication rate and some screenshots from a genome browser.

ADD COMMENT

Login before adding your answer.

Traffic: 2011 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6