Question

bamCompare The best normalization for weak signal ChIP-Seq

0

Entering edit mode

5.8 years ago

KB* ▴ 10

Good time of the day,

Could anybody please help (explain).

I am using bamCompare to normalise reads for computeMatrix plotHeatmap (in Galaxy). I aim to select genes having the protein of interest bound around TSS.

I have three ChIPed replicates. I have two controls (just in case, I did not know which one would work better). When I normalize my replicates (setting are given below), one of the replicates fails:

ERROR: The median coverage computed is zero for sample(s) #[1] Try selecting a larger sample size or a region with coverage

I found a solution here: https://github.com/deeptools/deepTools/issues/599

E.g. when I select Method to use for scaling the largest sample to the smallest: Signal extraction scaling (SES) I increased: Length in bases used to sample the genome and compute the size or scaling factors from default 100 to 1000

Now, there is a warning: The default is fine. Only change it if you know what you are doing. (--sampleLength)

Unfortunately I do not know what I am doing. Could anybody please explain how it affects the results?

Also, there is another warning, "Check with plotFingerprint before using it" My plotFingerprint shows that my replicates are very close to controls (weak signal) and the replica that fails is even closer (please plotFingerprint attached). plotFingerprint. 1 - input, 2, 3, 4 - replicas; 5 - IgG

Shall I use SES at all in my case? What would be the best normalisation method in this case?

Thank you :) plotFingerprint. 1 - input, 2, 3, 4 - replicas; 5 - IgG

bamCompare settings:

Bin size in bases 50
Method to use for scaling the largest sample to the smallest SES Length in bases used to sample the genome and compute the size or scaling factors. 100 Number of samplings taken from the genome to compute the scaling factors 100000
How to compare the two files log2
Pseudocount 1.0 Compute an exact scaling factor False
Coverage file format bigwig
Region of the genome to limit the operation to Empty.
Show advanced options no
Job Resource Parameters no

camCompare ChIP-Seq SES normalisation • 5.2k views

ADD COMMENT • link 5.8 years ago by KB* ▴ 10

score 1 · Answer 1 · 2019-07-05

What SES is attempting to do is separate your background regions from your regions of enrichment, and only calculate the scaling factor based on the regions of enrichment. This is to avoid an issue with other scaling methods such as ones based on RPKM, where the scaling factor can be affected by the signal-to-noise ratio and introduce biases between samples with different ratios.

What SES does is randomly sample the genome to determine background noise. If your sequencing is sparse, it may sample a region with zero reads and you get this error. The number of samplings tells it how many times to sample, and the sample length is the length in base pairs of the region randomly selected.

I think that in your case increasing the sample length is a reasonable thing to do, and you can ignore the warning it's giving.

BTW that plotFingerprint command can create a file of metrics for you (see the --outQualityMetrics parameter). The JSD metric can be useful in determining the enrichment of your IP. I find that for narrow marks I don't get very results from my analysis if it isn't at least 0.1.

score 0 · Answer 2 · 2019-07-05

@colin.kern Thank you very much for such a clear explanation!

And thank you for plotFingerprint suggestion. I will look into it.

To check for sample length affect the outcome, I built two heatmaps: a) with normalization done to input; b) the same to IgG

There are two repeats with sample length 100 and all three repeats for sample length 1000. The only difference is the visible on y axes (is that fold enrichment?). The "curves" just move down (attached). I used 6 k means as clustering algorithm

L15 - input; L 20 - IgG; L16, 18, 19 - IP repeats. Heatmap IPs normalised to IgG control

IPs normalised to input

Thank you again:)