Question

ChIP-Seq: When reducing paired end reads to one coordinate, should I use the midpoint?

3

Entering edit mode

8.3 years ago

endrebak ▴ 970

I am updating the SICER algorithm ( https://github.com/endrebak/epic ) and have added paired end support. The original SICER does not support paired end reads, but for the single end case it reduces each read to a single coordinate, which is the start of the read plus half the fragment size.

I have no paired end ChIP-Seq data myself to try it on, so I might have implemented a too naive paired end mode. It reduces each paired end read to a point by taking the leftmost and rightmost coordinate of (both the starts and ends of) a pair and finding its midpoint.

So for the (fake) read pair

chr7    20246668    20246669    chr7    20246693    20246694    U0  0   +   +

the coordinate is

20246694 + (20246694-20246668)/2 = 20 246 707

It seems like this might lead to a problem though:

If the two mates are very far apart, the midpoint might be in a bad (ie heterochromatic) or uninteresting (ie blacklisted) region.

What is the best way to solve this?

I can think of two solutions, but do not understand all their up- and downsides:

1) Discard read pairs more than say 100 bp apart 2) Treat each mate in a pair as an individual read

I lean towards the first solution since it seems like the paired end libraries my users use contains much much more data than a typical single end library, and doubling that amount seems like it would be a lot of pain (waiting) for little gain (better results).

ChIP-Seq epic SICER • 2.8k views

ADD COMMENT • link updated 8.3 years ago by Mads Lerdrup ▴ 460 • written 8.3 years ago by endrebak ▴ 970

2

Entering edit mode

8.3 years ago

endrebak ▴ 970

Here is how someone added pe support to SICER previously:

For paired-end reads, the HiChIP pipeline keeps the first end and extends by the fragment length estimated from mapping positions of the two ends, rather than by the average fragment length of the library. Given the variability of fragment lengths across a complex genome like human genome, the use of actual coordinates of mapped pairs is expected to achieve better resolution in signal visualization. The bed file is then used to generate a bedGraph file by the genomeCoverageBed command from BEDTools.

Since they probably did not change the code, but rather tried to work around it, this might be just the locally optimal solution.

Paper: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-280

ADD COMMENT • link 8.3 years ago by endrebak ▴ 970

score 5 · Accepted Answer · 2016-08-11

What a well-timed and IMHO interesting post :-)

I am working on adding paired end (PE) support to EaSeq, and I am going through exactly the same considerations and do not have any prior experience with PE as we have not used it for ChIP at our institute (yet). EaSeq also works with a single integer coordinate internally to allow many data sets to be loaded in memory simultaneously.

Back to your question: Midpoint sounds best to me, but I am not entirely sure why you would like to filter the reads as you propose. As I understand it then some people actually use PE ChIP-seq to map some reads to areas that are normally hard to map (e.g. heterochromatic regions). So at first thought I would leave those within the data set, but I would be very happy to learn from other opinions on this.

Moreover, for many ChIP-seq experiments you would probably discard a large fraction of the reads by setting the cutoff at 100 bp. I would like to know when/why you would prefer keeping/discarding these reads - in many cases additional (but larger) pairs would give stronger statistics, but reduce resolution. However, the question is whether this is a real-life issue for most applications? From my own experience the app. 20 bp resolution that I typically get from my peakfinding algorithm (see e.g. figure 5C here) is sufficient to get really nice de novo motif discovery with MEME and good central motif distribution with Centrimo - even when working on bulk single read ChIP-seq with an experimentally determined main fragment lenght of >100 bp.

Regarding the blacklisted regions, why not just remove read pairs that overlap with these in the upstream process? Although, I can also see that this might create a strange 'event horizon' in the vicinity of blacklisted regions as an increasing fraction of the pairs are discarded. It might be preferable to keep peak-finding simple and filter peaks from blacklisted regions later on?

For EaSeq, I think that the best solution would be to rewrite the central part of the code to accomodate two (or four) coordinates when that is prefered over an efficient memory footprint. That would 1) eliminate the annoying rounding error for odd sized fragments (and potentially a 1 bp shift when exporting data again) as well as allow for downstream visualization and filtering of reads based on distance between the pairs.

score 4 · Accepted Answer · 2016-08-11

Option #2 looks bad to me, exactly for the reasons you pointed out.

Concerning option #1, the cutoff (100 bp in your example, which is very stringent IMO) could be set independantly for each library or each experiment. Indeed, the expected fragment size can vary depending of experimental procedure (sonication, size selection, ...) For instance :

cutoff = median(fragments_size) + 2*sd(fragments_size)

Finally, even if you don't own PE data, you could download some on SRA and test the robustess of your method on various datasets.