I have received data where one (or more) of the steps ChIP-Seq has failed :
There was a large difference between the amount of reads generated for IP and for Input samples (3-5 million reads for IP, 20-30 million reads for Input).
The sequencing was paired-end. For the IP samples, R1 had a large amount of adapters (around 30% of the data did not pass the filtering stage, mainly due to adapter content). R2 did not have such a large percentage of adapters, but around 30% of the reads did not pass the filtering stage due to bad quality. Input samples did not have this problem.
When aligning, Read 2 of the IP samples did not have a higher percentage of unique alignment than 20%. ( R1 of the IP was uniquely aligned around 50% - 70%, with the input samples being aligned at a 75% rate.)
At the end, even for R1, there are only about 500K - 1 million reads uniquely aligned per each IP sample.
So, basically there is no data to work with.
What I would like to do however, is to understand whether the failure was (1) during the ChIP stage (2) During transit of the samples (the DNA was in transit around 10 days). (3) Library preparation.
Can you suggest how it can be checked, in case it can be?
Preparing the libraries anew / resequencing is currently not possible, unfortunately.
What are the read lengths of R1 and R2 (sounds like it was unequal) and what was the average size of the fragments that went into the library prep? Did you align R1 and R2 separately? What about these reads that did align. Can you call peaks with them and do results look at least somewhat normal on a genome browser (=can you see peaks)? Well, 10 days is long, was the content at 4°C or below or was it at room temperature? Did the libraries look good on a Bioanalyzer after library prep?
It seems like not only he aligned R1 and R2 separately, but also did the quality and adapter trimming separately for R1 and R2.
Yes. TruSeq adapters were used, and these are different for R1,R2.
The read lengths of both R1 and R2 are 43. I have aligned R1 and R2 separately (when I tried paired-end alignment, the percentage of unaligned reads was slightly bigger than the sum of the percentage of unaligned reads for R1 and R2 separately).
I have called peaks for some samples; from what I see so far, the maximum pileup for a peak is about 9 reads, with about 100 peaks. So FRiP is hardly existent :)
It is not clear and impossible to know whether the content was actually stored in a cool environment.
I do not yet understand the TapeStation graphs myself, but I was now told by the sequencing center that the libraries had sequences that were "too long" (need to find out how long exactly). Will update when this is more clear.
I mostly work with single-end ChIP-seq, so I can't say much about some of your observations. For the Input getting a lot more reads than the IP, did you pool them on a single lane or sequence them separately? If pooled, did you just combine equal volumes or did you measure the concentrations and pool based on that?
High rates of duplication can indicate too many PCR cycles. If you needed to do a lot of cycles because of low concentration from IP, then you may need to optimize your shearing or ChIP stage, or just use a larger number of cells/more tissue from the start.
The samples were multiplexed and sequenced on the same lane. I do know that the concentrations were measured, and combined based on them. This is why the difference between IP and Input is surprising. Thanks.