In a recent publication we found a RNA binding site in an Alu repeat. The signal was lost, when turning off the multiple mapping parameter of the mapping algorithm.
Thus, I recommend to allow multiple mappings, when analyzing ChIPseq data. You might find fancy stuff! :)
I will admit to only having limited knowledge in this area. I have some concern with allowing reads that map to multiple regions. Lets say we had two regions A and B, and that both have identical sequence. If we do alignment and reads are mapped randomly, these areas will get reads in a 50-50 ratio. Suppose that in the original data set only A has a TF binding site and B does not at all. In that case, we might create a signal for a false binding site at B.
One solution is to attribute duplicate reads to A and B in a ratio consistent with the unique reads mapped to both. This means they won't be under or over-populated with reads relative to their positions in the genome, but it still has the problem that one might be imposing a pattern that might not be there.
Actually, that is exactly the problem. You cannot say where the 'real' origin of the read is.... so most researchers delete these reads. But if the read really comes from a repeat associated region, you will miss it for sure. You miss ALL reads coming from repeat associated regions. Even though they might be real ones, like in our study.
Your post holds a lot of interest for me as I am working on a project where we absolutely expect that the repetitive elements in the genome will have some biological function. I have gone back and forth on what to do about the repetitive elements. So, it's intriguing that you were able to tease out some signal from the noise in multiple mapped reads.
I haven't 100% decided yet, but I think based on your post, I might make some modifications to my pipeline.
To be honest with you this is always one of those things that never ceased to irk me - people simply ignoring repetitive regions in the genome when producing large scale reports on the potential functionality of genome. Glad to see papers that point this out.
We used the default -M parameter of segemehl. That means, that all seeds with more than 100 matches were discarded. But, of course, that value can be adjusted, if you wanna find more hits.
By default we always map multi-reads. However, one should take the following considerations into account:
Is is important to use and index that contains all sequences, even if
they are not properly placed on the assembly. For human and mouse,
those are the _random chromosomes. If all sequences are not included,
the random placement of multi-reads will be biased towards those in
the assembly, thus creating artificial enrichments. Because some
reads will be considered unique by the assembler if these extra
regions are not included in the index it is a general good policy to
always include those regions.
Removal of duplicates its complicated. We don't do it. If you remove
duplicates you are biasing your results towards repetitive regions.
Imagine a read that maps to two distinct positions an a read that is
unique. Now, assume that these two reads are duplicated. By random
placement of the multi-read, each of the repetitive regions will get
one copy. For the unique read, the two duplicates will map to the
same position. A duplicate removal algorithm will only detect the
unique read as duplicate because they are based on position. On the
other hand, if reads having identical sequences are removed before
mapping, repetitive regions have a higher chance of producing
identical reads. A duplicate removal of this sort will target
repetitive regions preferentially, specially if they are highly
repetitive. To reduce the duplication rates we use low PCR cycles for the sample
preparation.
Some tools, specially peaks callers, use the so called 'effective
genome size', which is the size of the genome after you remove
duplicates. If you are planing to do peak calling you need to upgrade
these values and don't use the default ones. Furthermore, it is important
to tell the peak caller not to remove duplicates.
When looking at the mapping results we always keep an eye on the
mappability track to detect spurious findings. For mouse and human
they are easily available from USCS. For other species they are more
difficult to obtain. In our case the authors of the
Piero Carninci pointed me to an interesting publication he co-authored. I think, it is of interested for this post, since it highlights the immense importance of repeat regions.
Their message: SINE elements have so many other important aspects. They identified a role of SINEB2 elements in the stimulation of protein translation.
Actually, that is exactly the problem. You cannot say where the 'real' origin of the read is.... so most researchers delete these reads. But if the read really comes from a repeat associated region, you will miss it for sure. You miss ALL reads coming from repeat associated regions. Even though they might be real ones, like in our study.
Your post holds a lot of interest for me as I am working on a project where we absolutely expect that the repetitive elements in the genome will have some biological function. I have gone back and forth on what to do about the repetitive elements. So, it's intriguing that you were able to tease out some signal from the noise in multiple mapped reads.
I haven't 100% decided yet, but I think based on your post, I might make some modifications to my pipeline.