Question

Downsampling single-end human ChIP-seq data

1

Entering edit mode

5.9 years ago

msimmer92 ▴ 310

The sequencing of this ChIP-seq H3K4me3 experiment (human, single-end 50bp) has a problem with the number of reads per sample. The person that did the sequencing did not arrange correctly the samples in the lanes, causing the bias you see in the picture (first, upper one). I am trying to "rescue" it doing downsampling, but wanted to make sure I am doing it correctly without loosing too much coverage that I loose robustness in whatever finding I have later on.

I did the downsampling as described in Ming Tang's blog : https://crazyhottommy.blogspot.com/2016/05/downsampling-for-bam-files-to-certain.html . Here there is an example line of code:

samtools view -s 3.6 -b my.bam -o subsample.bam

I did three downsamplings to see what happens: one to 20 M reads, one to 25M and one to 10M. In the figure below you only see raw data, 25M and 10M in that order (from top to bottom). My questions are:

1) How much you think I can downsample without having an unreasonable coverage? Is 10M ok or given that is single-end more should be used? (for example, in Girdhar et al 2018's schizophrenia chip-seq paper in humans they used 13M but it was paired-end sequencing).

2) If I end up deciding 10M is too little and I go with the 25 or 20M one (please, look at the second plot), you can still see a difference between the samples. Is that difference small enough to continue analysing or you think that's already too much?

3) In case in 2) you consider this difference is too big and that that's not good, you have another suggestion of what I could do?

Thank!

enter image description here

ChIP-Seq downsampling subsampling • 3.4k views

ADD COMMENT • link updated 5.9 years ago by colin.kern ★ 1.1k • written 5.9 years ago by msimmer92 ▴ 310

1

Entering edit mode

5.9 years ago

ATpoint 88k

Can you give some background on how the samples are related? If all these samples go into the same differential analysis, I would not do anything towards downsampling, you only throw away information. Call peaks over all reads of IP combined against all input reads combined and then proceed with differential analysis. The normalization of e.g. csaw should take care of the rest. Check by PCA/MDS and Pearson corr. are confounded by the low depth. If so, maybe removing a few of them is better than downsampling as this would penalise all samples. Depends on the question and analysis strategy.

ADD COMMENT • link 5.9 years ago by ATpoint 88k

0

Entering edit mode

Yes, sorry. These are samples from humans with different disease stages and then controls (groups A-H, one of them is control and the others incremental stages). H is control, then A is the first mild stage and E is the worst. The idea is to call peaks with MACS2 and do differential binding analysis with DiffBind, further annotation and pathway analysis with R's ChIPseeker and ClusterProfiler. After consulting with you and others in the Biostar chat, I am doing the CSAW analysis in parallel as a control (at least to see how the MA plots look like). The normalization method in DiffBind that I selected is the default (TMM, full library size) but there's also the option of choosing "TMM effective size". The suggestions you made are good, thank you!

ADD REPLY • link 5.9 years ago by msimmer92 ▴ 310

1

Entering edit mode

If possible, maybe just re-sequence the under-sequenced samples to get some more reads given the quality control of the data you have looks promising in terms of IP efficiency etc.

ADD REPLY • link 5.9 years ago by ATpoint 88k

0

Entering edit mode

This would be my preference but unfortunately the guys that produce this data ran out of tissue and getting more is not easy. But thank you! Additionally, I want to say I tried the downsampling and the results look the same (in the sense that the results I'm observing seem to be so strong and the DiffBind normalization so good that they are still well-seen even with that). I tried other normalization methods and I see the same. I still have to try more suggestions that people here did to me, but so far it's looking good. Thank you very much :)

ADD REPLY • link 5.9 years ago by msimmer92 ▴ 310

0

Entering edit mode

I mean sequence the exact same pool of DNA again. For a NGS run you typically do only need a fraction of the entire library. There should be something left in the tube, no? I typically submit like 25-50% of a sample to sequencing to have a backup in exactly that case or in case of technical errors with the machine.

ADD REPLY • link 5.9 years ago by ATpoint 88k

score 4 · Accepted Answer · 2019-06-28

1) How much you think I can downsample without having an unreasonable coverage? Is 10M ok or given that is single-end more should be used? (for example, in Girdhar et al 2018's schizophrenia chip-seq paper in humans they used 13M but it was paired-end sequencing).

The current ENCODE standard for H3K4me3 is a minimum of 20M mapped and filtered reads, however the previous standard was 10M. Because the important aspect of ChIP-seq analysis is the location of the alignments, not the actual sequence content of the reads, as long as the alignment rate is good with single-end reads then it's not much different from a similar paired-end depth.

2) If I end up deciding 10M is too little and I go with the 25 or 20M one (please, look at the second plot), you can still see a difference between the samples. Is that difference small enough to continue analysing or you think that's already too much?

It depends on the analysis you will be doing with the data. If your analysis is going to be based on looking at peak calls, for example, I would go ahead and do your peak calling on the full data and see how affected that is by the differences in depths.

Btw, are the numbers you're showing the raw reads or the number of reads after mapping and removing duplicates?