Question

How Do You Decide When A Sequencing Run Is Too Bad To Have Reliable Information?

11

Entering edit mode

12.1 years ago

KCC ★ 4.1k

I recently processed some ChIP-seq sequencing data which had quite a few issues with it. The run consisted of two ChIP replicates and two input controls. For instance, the samples were supposed to have equal numbers of reads, but two of the samples (the input controls) were about 90% of the reads. There were a few other things wrong as well, the GC content of the ChIP samples was weirdly high, much higher than the reference. A histogram of the GC content of the ChIP sample reads (done by fastqc) showed a weird curve that wasn't even vaguely Gaussian shaped. It looked like there might have been multiple overlapping Gaussians. Two sequences were over-represented and identified by fastqc as Truseq adapters.

So, recall that only 10% of my run were reads from the ChIP samples. This amounted to about 10 million reads. After aligning wiith bwa, filtering for duplictes with picard and filtering for proper pairs with samtools, I ended up with 2 million reads. This was very low sequencing depth for my genome, with a total sequencing depth of 2, approximately 1 for each ChIP sample.

The peaks actually look somewhat reasonable. There are a few nice peaks located in some of the kinds of places we expect like at the 5' area.

Would you trust a run like this?
What are the general principles you have about accepting or saying a run is bad? My gut is the run is bad and the data is unreliable. I would rather not say that my 'gut' feeling is this data is unreliable to my wet lab colleagues.
I also think the coverage is much too low. Would the consensus be that two samples with a sequencing depth of one each is too low?

EDIT: Some more details about my run.

As the percentage of the mapped reads, I had 85% duplicates in one replicate and 57% duplicates in the other. I filtered the duplicates with picard. For the input controls, the duplication rate was 10% in one case and 25% in the other.

chip-seq • 8.0k views

ADD COMMENT • link updated 11.3 years ago by Biostar 20 • written 12.1 years ago by KCC ★ 4.1k

score 5 · Answer 1 · 2013-04-30

5

Entering edit mode

12.1 years ago

Sean Davis 27k

I'd suggest reviewing this manuscript from ENCODE.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/

Without doing anything else, though, it does sound as if you will need to repeat at least parts of the experiment. The question is whether or not new libraries need to be produced (as opposed to running the same libraries again); it sounds like you probably need new libraries. After reviewing the manuscript above, perhaps you can be clearer with collaborators about what makes these results less than optimal.

ADD COMMENT • link 12.1 years ago by Sean Davis 27k

3

Entering edit mode

This is a good paper, and a key reference for you to read and understand. Average genome coverage is not the right metric to think about for ChIP-seq - the enrichment around peaks means you need many fewer reads to obtain useful signal. As a point of comparison, the ENCODE folks recommend 20M mapped reads for human, which is less than 1X coverage. Median enrichment around confident peaks is a more appropriate metric here.

I would also suggest downloading a "known good" ChIP-seq experiment (from ENCODE or elsewhere) and calibrating your intuition based on that. In my experience you will almost always see somewhat strange things like biased GC content in the sample channel, as well as a bit of adapter contamination, due to the selection process and generally smaller library sizes. The key question is whether enough useful signal remains after these effects, though.

ADD REPLY • link 12.0 years ago by matted 7.8k

0

Entering edit mode

Thanks. that was very helpful.

ADD REPLY • link 12.0 years ago by KCC ★ 4.1k

score 5 · Answer 2 · 2013-05-01

Whether I trust the run would depend on what was being immunoprecipitated (IP'd). If the target is a transcription factor that binds to only a few places, would it be any surprise that the GC content is different than the whole genome overall? On the other hand, if the target is a histone that binds broadly, or many many locations, it would have different properties.
One of the key things that worries me when evaluating a sample like the one you're describing is the duplication rate. You mention that duplicates were filtered out, but you didn't mention the rate. A high rate of duplication suggests a low complexity sample that may have been over-amplified.
The depth of the input sample versus the depth of the IP could easily be vastly different if the IP is a transcription factor that binds only a few places. On the other hand, if it is a different kind of protein your expectations might be different. You did say that you find peaks, and they make some sense. Another thing to consider, that you didn't mention, is precisely how the samples were sequenced. If the IP is a low complexity sample, and it is sequenced with Illumina technology, and if it was sequenced as the only sample of it's type in a lane (i.e. if multiplexing was not utilized, and the IPs were sequenced by themselves), the yield of reads may be suffering from lack of complexity on the flow cell. To distinguish clusters on the flow cell, sample diversity is important, and one could expect the input sample to have higher complexity over all than the IPs.

I realize I haven't said anything that might actually help you, I've only questioned your biases of what to expect of the IP. But of the peaks look reasonable (as you say), then you might simply mention various caveats (duplication rate, yield of reads), but perhaps the only thing to do is confirm the peaks you see via other experiments.

Alternatively, if you want to test the library construction, or the ChIP antibody, consider including a positive control this time, a good antibody that pulls down a well characterized protein, or any sample that's known to produce a good library, so that you can compare something that is known to give a "good" data set versus something with unknown properties.

score 3 · Answer 3 · 2013-04-30

3

Entering edit mode

12.1 years ago

William ★ 5.3k

I haven't tried it yet myself but this is supposed to be a good chipseq QC tool which I will try the next time I have to analyse chipseq data. Or look for something comparable.

CHANCE: comprehensive software for quality control and validation of ChIP-seq data http://genomebiology.com/2012/13/10/R98

ADD COMMENT • link 12.1 years ago by William ★ 5.3k