I am updating the SICER algorithm ( https://github.com/endrebak/epic ) and have added paired end support. The original SICER does not support paired end reads, but for the single end case it reduces each read to a single coordinate, which is the start of the read plus half the fragment size.
I have no paired end ChIP-Seq data myself to try it on, so I might have implemented a too naive paired end mode. It reduces each paired end read to a point by taking the leftmost and rightmost coordinate of (both the starts and ends of) a pair and finding its midpoint.
So for the (fake) read pair
chr7 20246668 20246669 chr7 20246693 20246694 U0 0 + +
the coordinate is
20246694 + (20246694-20246668)/2 = 20 246 707
It seems like this might lead to a problem though:
If the two mates are very far apart, the midpoint might be in a bad (ie heterochromatic) or uninteresting (ie blacklisted) region.
What is the best way to solve this?
I can think of two solutions, but do not understand all their up- and downsides:
1) Discard read pairs more than say 100 bp apart 2) Treat each mate in a pair as an individual read
I lean towards the first solution since it seems like the paired end libraries my users use contains much much more data than a typical single end library, and doubling that amount seems like it would be a lot of pain (waiting) for little gain (better results).
Good points. Perhaps I should let users choose between several options:
In addition I'll have a cutoff that users can set themselves or use Carlo Yagues suggestion.
Implementing all should be trivial, but most users will probably be confused about what they should choose.
That is always the question - how much flexibility should you add on the cost of reducing user-friendlyness. What is your target group? What level of experience do you expect them to have? And how often do you believe they will run the software? And also, how central is that part of the workflow for the overall output quality?
An trained group of users using the software frequently would prefer having options, whereas non-expert people using it a few times a year would have a problem remembering what the options mean.
Ideally, you would make some tests to see how important various size cutoffs are for finding validated (or motif containing) peaks in terms of % true positives found and distances from peak to motif. If the cutoff is of minor importance or reduces output quality, then leave it out. But I know that this a lot of work to make for design choices...:-P
For the sake of simplicity, I think that I would leave out the options to use the mate coordinates unless you have experienced any scenarios where the procedure would benefit from that option.
I agree, IMO only the midpoints are making sense.