Question

News:'Potential Rna-Dna Binding Site In Alu Repeat', Or 'Why Should I Allow Multiple Mapping Positions When Analyzing Chipseq Data?'

4

Entering edit mode

11.4 years ago

David Langenberger 11k

In a recent publication we found a RNA binding site in an Alu repeat. The signal was lost, when turning off the multiple mapping parameter of the mapping algorithm.

Thus, I recommend to allow multiple mappings, when analyzing ChIPseq data. You might find fancy stuff! :)

Holdt LM, Hoffmann S, Sass K, Langenberger D, Scholz M, et al. (2013) Alu Elements in ANRIL Non-Coding RNA at Chromosome 9p21 Modulate Atherogenic Cell Functions through Trans-Regulation of Gene Networks. PLoS Genet 9(7): e1003588. doi:10.1371/journal.pgen.1003588

chipseq rna • 3.7k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 11.4 years ago by David Langenberger 11k

score 3 · Answer 1 · 2013-07-05

3

Entering edit mode

11.4 years ago

KCC ★ 4.1k

Congratulations on your interesting finding!

I will admit to only having limited knowledge in this area. I have some concern with allowing reads that map to multiple regions. Lets say we had two regions A and B, and that both have identical sequence. If we do alignment and reads are mapped randomly, these areas will get reads in a 50-50 ratio. Suppose that in the original data set only A has a TF binding site and B does not at all. In that case, we might create a signal for a false binding site at B.

One solution is to attribute duplicate reads to A and B in a ratio consistent with the unique reads mapped to both. This means they won't be under or over-populated with reads relative to their positions in the genome, but it still has the problem that one might be imposing a pattern that might not be there.

Any thoughts?

ADD COMMENT • link 11.4 years ago by KCC ★ 4.1k

0

Entering edit mode

Actually, that is exactly the problem. You cannot say where the 'real' origin of the read is.... so most researchers delete these reads. But if the read really comes from a repeat associated region, you will miss it for sure. You miss ALL reads coming from repeat associated regions. Even though they might be real ones, like in our study.

ADD REPLY • link 11.4 years ago by David Langenberger 11k

1

Entering edit mode

Your post holds a lot of interest for me as I am working on a project where we absolutely expect that the repetitive elements in the genome will have some biological function. I have gone back and forth on what to do about the repetitive elements. So, it's intriguing that you were able to tease out some signal from the noise in multiple mapped reads.

I haven't 100% decided yet, but I think based on your post, I might make some modifications to my pipeline.

ADD REPLY • link 11.4 years ago by KCC ★ 4.1k

score 2 · Answer 2 · 2013-07-05

2

Entering edit mode

11.4 years ago

Istvan Albert 101k

To be honest with you this is always one of those things that never ceased to irk me - people simply ignoring repetitive regions in the genome when producing large scale reports on the potential functionality of genome. Glad to see papers that point this out.

ADD COMMENT • link 11.4 years ago by Istvan Albert 101k

score 1 · Answer 3 · 2013-07-08

1

Entering edit mode

11.4 years ago

daiefa123 ▴ 150

Congrats on this manuscript.

Didn't you use any maximum number of multiple hits? The output bam file must have been huge.

ADD COMMENT • link 11.4 years ago by daiefa123 ▴ 150

0

Entering edit mode

We used the default -M parameter of segemehl. That means, that all seeds with more than 100 matches were discarded. But, of course, that value can be adjusted, if you wanna find more hits.

ADD REPLY • link 11.4 years ago by David Langenberger 11k

score 1 · Answer 4 · 2013-07-08

By default we always map multi-reads. However, one should take the following considerations into account:

Is is important to use and index that contains all sequences, even if they are not properly placed on the assembly. For human and mouse,
those are the _random chromosomes. If all sequences are not included, the random placement of multi-reads will be biased towards those in
the assembly, thus creating artificial enrichments. Because some reads will be considered unique by the assembler if these extra regions are not included in the index it is a general good policy to always include those regions.
Removal of duplicates its complicated. We don't do it. If you remove duplicates you are biasing your results towards repetitive regions.
Imagine a read that maps to two distinct positions an a read that is unique. Now, assume that these two reads are duplicated. By random
placement of the multi-read, each of the repetitive regions will get one copy. For the unique read, the two duplicates will map to the
same position. A duplicate removal algorithm will only detect the
unique read as duplicate because they are based on position. On the
other hand, if reads having identical sequences are removed before
mapping, repetitive regions have a higher chance of producing
identical reads. A duplicate removal of this sort will target repetitive regions preferentially, specially if they are highly repetitive. To reduce the duplication rates we use low PCR cycles for the sample preparation.
Some tools, specially peaks callers, use the so called 'effective genome size', which is the size of the genome after you remove duplicates. If you are planing to do peak calling you need to upgrade these values and don't use the default ones. Furthermore, it is important to tell the peak caller not to remove duplicates.
When looking at the mapping results we always keep an eye on the mappability track to detect spurious findings. For mouse and human they are easily available from USCS. For other species they are more difficult to obtain. In our case the authors of the

score 0 · Answer 5 · 2013-07-15

Piero Carninci pointed me to an interesting publication he co-authored. I think, it is of interested for this post, since it highlights the immense importance of repeat regions.

Carrieri, C., Cimatti, L., Biagioli, M., Beugnet, A., Zucchelli, S., Fedele, S., ... & Gustincich, S. (2012). Long non-coding antisense RNA controls Uchl1 translation through an embedded SINEB2 repeat. Nature, 491(7424), 454-457.

Their message: SINE elements have so many other important aspects. They identified a role of SINEB2 elements in the stimulation of protein translation.