I recently processed some ChIP-seq sequencing data which had quite a few issues with it. The run consisted of two ChIP replicates and two input controls. For instance, the samples were supposed to have equal numbers of reads, but two of the samples (the input controls) were about 90% of the reads. There were a few other things wrong as well, the GC content of the ChIP samples was weirdly high, much higher than the reference. A histogram of the GC content of the ChIP sample reads (done by fastqc
) showed a weird curve that wasn't even vaguely Gaussian shaped. It looked like there might have been multiple overlapping Gaussians. Two sequences were over-represented and identified by fastqc
as Truseq adapters.
So, recall that only 10% of my run were reads from the ChIP samples. This amounted to about 10 million reads. After aligning wiith bwa, filtering for duplictes with picard and filtering for proper pairs with samtools, I ended up with 2 million reads. This was very low sequencing depth for my genome, with a total sequencing depth of 2, approximately 1 for each ChIP sample.
The peaks actually look somewhat reasonable. There are a few nice peaks located in some of the kinds of places we expect like at the 5' area.
Would you trust a run like this?
What are the general principles you have about accepting or saying a run is bad? My gut is the run is bad and the data is unreliable. I would rather not say that my 'gut' feeling is this data is unreliable to my wet lab colleagues.
I also think the coverage is much too low. Would the consensus be that two samples with a sequencing depth of one each is too low?
EDIT: Some more details about my run.
As the percentage of the mapped reads, I had 85% duplicates in one replicate and 57% duplicates in the other. I filtered the duplicates with picard. For the input controls, the duplication rate was 10% in one case and 25% in the other.
This is a good paper, and a key reference for you to read and understand. Average genome coverage is not the right metric to think about for ChIP-seq - the enrichment around peaks means you need many fewer reads to obtain useful signal. As a point of comparison, the ENCODE folks recommend 20M mapped reads for human, which is less than 1X coverage. Median enrichment around confident peaks is a more appropriate metric here.
I would also suggest downloading a "known good" ChIP-seq experiment (from ENCODE or elsewhere) and calibrating your intuition based on that. In my experience you will almost always see somewhat strange things like biased GC content in the sample channel, as well as a bit of adapter contamination, due to the selection process and generally smaller library sizes. The key question is whether enough useful signal remains after these effects, though.
Thanks. that was very helpful.