Are there any chip-seq peak callers able to find cases where the region under a peak has been split by an indel between the reference and the chipped sample?
For example, let's suppose one has done chip-seq on individual 1. Let's suppose we were to resequence individual 1, assemble the genome, and map the chip-seq reads straight to its assembled genome, and there would a peak like this:
Now the reference genome has an insertion with respect to the chipped sample, so when one maps chip-seq reads from individual 1 to the reference genome assembly, the peak looks like this:
Peak callers generally treat reads as tags. They know nothing about indels in the reads themselves. I might be thinking about this wrong, but for a regulatory region disrupted (biologically) by an indel, I would think that the most likely result that you would see in your data is a lack of reads (no peak).
@Aaron Statham: I would like to use a peak caller that identifies reads overlapping across indels, even if it is after a mapping process that doesn't take that into account.
@Aaron Statham: I would like to use a peak caller that identifies reads overlapping across indels, even if it is after a mapping process that doesn't take that into account.
@Sean Davis: any of events is allegedly interesting to study, since the regulatory region may have been disrupted by the indel, but is present in other individuals.
@Aaron Statham: I would like to use a peak caller that identifies reads overlapping across indels, even if it is after a mapping process that doesn't take that into account.
@Sean Davis: any of events is allegedly interesting to study, since the regulatory region may have been disrupted by the indel, but is present in other individuals.
We created our own simplistic peak caller to be able to join neighbouring peaks together as a single region of interest. See my answer in this thread. Depending on the gap size in your individuals and whether you are interested in differentiating nearby peaks, a similar approach might work
Not specifically. We just defined a peak which is above a defined background level, between a min and max length and merge peaks within a certain range of each other.
Unless your mapper aligns reads across the indel (similarly to RNA splicing), I dont know how you would differentiate a deletion in your sample versus two separate peaks... A couple of ideas off the top of my head
Take all genomic regions where two (or more?) peaks are very close together - then use these regions as a reference genome upon which you remap your unaligned reads against with a more sensitive aligner, BLAT maybe?
I would think that the distribution of + and - strand reads at such indel peaks would look different to a normal peak, perhaps something could be exploited there (only if the indel is larger than the ChIP fragment size I guess)
If you are serious about finding these kind of events (this is a rather esoteric problem), I would create a simulated set of data and see what strategy performs best when you absolutely know there an indel (ROC curves and the like).
For small indels such as those picked up by aligners, I don't think that most chip-seq softwares will even provide a split peak (because each read is treated as a simple tag). I think that indels large enough to create two peaks where there was only one are probably relatively uncommon (but indel variation, in general, is a significant source of variation), at least in humans. But, if you are really interested in finding these things as an academic exercise, I would think that the best bet is to use paired-end sequencing for your chip-seq data generation. Your chip-seq peak-finding can then take advantage of both ends of the fragments (not sure which software for chip-seq can use paired-ends, but it wouldn't be hard to patch something together, I don't think). You can also apply typical structural variant software and ideas to find putative regions of structural variation. Layer your peak calls on your structural variation findings and you have your answers.
Do you expect a large portion of your peaks to have this problem or is this a question of a single region of interest?
Are you using a mapper which identifies/maps reads across these indels?
Peak callers generally treat reads as tags. They know nothing about indels in the reads themselves. I might be thinking about this wrong, but for a regulatory region disrupted (biologically) by an indel, I would think that the most likely result that you would see in your data is a lack of reads (no peak).
@Aaron Statham: I would like to use one that does that, even if it is after the mapping process.
@Aaron Statham: I would like to use a peak caller that identifies reads overlapping across indels, even if it is after a mapping process that doesn't take that into account.
@Aaron Statham: I would like to use a peak caller that identifies reads overlapping across indels, even if it is after a mapping process that doesn't take that into account. @Sean Davis: any of events is allegedly interesting to study, since the regulatory region may have been disrupted by the indel, but is present in other individuals.
@Aaron Statham: I would like to use a peak caller that identifies reads overlapping across indels, even if it is after a mapping process that doesn't take that into account.
@Sean Davis: any of events is allegedly interesting to study, since the regulatory region may have been disrupted by the indel, but is present in other individuals.