Question

Why Does Chip-Seq Data Have Lots Of Reverse-Complement Pairs Of Reads?

7

Entering edit mode

13.2 years ago

Ryan Thompson ★ 3.7k

I've been creating autocorrelation plots for my ChIP-Seq data, and I'm seeing a peak at the read length, just as described here:

... as the quality of samples decrease, the signal from tags mapping as palindromes increases. For example, the input (mock IP) control sample below shows a strong spike at 23 bp, which corresponds to the length of sequence used to map that sample.

Having an "opposite strand" peak at the read length indicates that in many cases pairs of reads are mapping as approximate reverse complements of each other. That is, if a read maps at position X on the plus strand, then it is likely that another read will map at position X + read length on the minus strand. Here is an ASCII art diagram of what is happening:

       --------------------->
==========================================================
       <---------------------

Can anyone explain why many pairs of tags would map as reverse complements of each other? The data is single-end Illumina sequencing. The linked article calls this "the signal from tags mapping as palindromes". I suppose this means that a read could map as its own reverse complement, and that if many reads do so, they will produce this peak. However, I wrote a script to see if any of my reads are identical to their reverse complement, and it didn't find any.

##EDIT

This is not a plot of coverage depth vs genomic coordinate. This is a so-called "autocorrelation plot", which can be described as follows: For every distinct pair of reads in the data set, subtract their starting posisions. Split these differences into three groups: both reads on plus strand, both reads on minus strand, and one of each. Make a histogram of the resulting groups. The plot above shows the first two groups in blue, and the third group in red.

In a good sample, I expect a red peak centered at the footprint size of my protein of interest. For example, I am doing MNase-seq for studying histone modifications, so I expect a peak at around 150 bp, the approximate length of DNA in one nucleosome. I expect this peak to always occur in the same location regardless of the read length that was used during sequencing.

In contrast, I also see a peak in my data around 40 bp, which is exactly the read length used in the sequencing run. This peak occurs even in the control, in which no IP was performed at all and no significant peaks corresponding to nucleosome sizes are visible. Based purely on the fragmentation process and IP pulldown, I cannot imagine how this peak would occur. It seems like it must be an artifact of the sequencing process, or the wet-lab prep before sequencing.

So, to be clear, I understand why there would be a peak at the x-coordinate corresponding to the footprint size of my protein (i.e. fragment length of DNA). That's why I made the plots in the first place. What I don't understand is the other peak that corresponds to the read length.

chip-seq • 7.1k views

ADD COMMENT • link updated 12.6 years ago by Ying W ★ 4.3k • written 13.2 years ago by Ryan Thompson ★ 3.7k

Istvan Albert · Answer 1 · 2012-03-14

4

Entering edit mode

13.2 years ago

Chris Whelan ▴ 590

A recent post on the MACS mailing list talked about this:

http://groups.google.com/group/macs-announcement/msg/d6595465a1f9b212

The reason for the peak at read length is due to varying mappability along the genome. If a 23mer one on strand is uniquely mappable, the reverse complement 23mer on the other strand is also uniquely mappable, and therefore you're more likely to get reads mapping exactly there than at some other point where the mappability is unknown.

ADD COMMENT • link updated 12.6 years ago by Istvan Albert 102k • written 13.2 years ago by Chris Whelan ▴ 590

0

Entering edit mode

So, to put things in probabilistic terms, if N+ is the event that a read maps starting at position N on the plus strand and N- is the event that a read maps as the reverse-complement of the sequence starting at position N, then what this is saying is that P(N-|N+) > P(N-)?

ADD REPLY • link 13.2 years ago by Ryan Thompson ★ 3.7k

0

Entering edit mode

Yes, exactly, that's another way of saying it. I've found that seeing this peak at the read length as the only peak on this plot is a good test for ChIP-Seq experiments that fail to enrich for some reason (bad antibody, etc).

ADD REPLY • link 13.2 years ago by Chris Whelan ▴ 590

0

Entering edit mode

Or, yet another way to explain it is that the red (opposite-strand) peak in the image in the question is there for the same reason that there is a blue peak at zero.

ADD REPLY • link 13.2 years ago by Ryan Thompson ★ 3.7k

Istvan Albert · Answer 2 · 2012-03-14

2

Entering edit mode

13.2 years ago

Ying W ★ 4.3k

please read the original macs paper:

http://www.biomedcentral.com/content/pdf/gb-2008-9-9-r137.pdf

I think they made it pretty clear but the short answer is the protein you are interested in is sitting in that gapped part covered by less reads and based on the way sequencing and extension works you will get these reverse-complement pairs

ADD COMMENT • link updated 12.6 years ago by Istvan Albert 102k • written 13.2 years ago by Ying W ★ 4.3k

1

Entering edit mode

This is a peak occurring at exactly the read length. This peak is not correlated in any way with the size of the protein's footprint. In my samples, the read length is 40 bp and the fragment size is known a priori to be 150. I see peaks at both distances, except in the control where I only see a peak at 40 bp. I've edited my question to make this clearer.