Question

How to annotate repeats?

0

Entering edit mode

7.0 years ago

Zee_S ▴ 60

Hello Biostars community!

I have a database of consensus transposon sequences for a model organism. These sequences come from diverse families and range from very short SINE elements to longer LINEs and LTRs. I want to annotate these repeats on the reference genome and later use those coordinates to correlate repeat density with chIP seq signal.

I would be happy to get your suggestions on tools that I can use to annotate the repeats. and also, what kind of normalization one must do to account for the large sequence length variation between different repeat families. for example, a sine element could give many hits on the genome just because it is relatively short, whereas an LTR may not. so how do you validate your annotations in the end?

Thank you very much for your guidance!

repeats annotation SINE LINE normalization • 4.7k views

ADD COMMENT • link updated 7.0 years ago by Beuss ▴ 140 • written 7.0 years ago by Zee_S ▴ 60

2

Entering edit mode

Hi! Have you tried RepeatMasker? http://www.repeatmasker.org/

ADD REPLY • link 7.0 years ago by alessandrotestori7 ▴ 420

score 5 · Answer 1 · 2018-04-17

Hi,

As usual, all depend of which question you to answer. I you want an idea of the global quantity of repeat for your genome, a quick annotation with RepeatMasker/Repbase can be enough. Or there a public annotation layer already available ?

But in your case, because you are searching for a link between binding sites and Transposable Element (TE) presence/absence, you should go for a deeper annotation of repeats. Maybe a TE database dedicated to your specie exists and could be used with TEannot (REPET pipeline) or RepeatMasker to obtain a better annotation. If the available databases are too far from the specie you are analysing or if no data are available you should go for a de novo detection/annotation of repeats. This is a big task, but if you want to be exhaustive, you have no choice.

what kind of normalization one must do to account for the large sequence length variation between different repeat families. for example, a sine element could give many hits on the genome just because it is relatively short, whereas an LTR may not

Your are wrong on this on point. Unless your are SINE copies are less than log4(N) + 1 base pairs (where N is your genome size in base pairs), these copies are real and not issues from random. So you should not under estimate their importance in your analysis. Moreover, if it's the case that would mean the annotation had very bad quality.

so how do you validate your annotations in the end?

You could validate your annotation through the validation of consensus by checking if each consensus you used have at least 3 complete copies in the genome. But you also have to be aware that TE could derives very fast and so a lot of degraded copies of the original TE are also present in the genome. That I why prefer use several consensus (1 for each main degraded copies), TE models, for describing and annotate the whole diversity of TEs.

Anyway this is a very large subject with a lot of debate.

Here a sample of publications for discovering the beautiful world of TEs and their annotation :