Question

Comparing repeatmasker outputs between species

0

Entering edit mode

2.9 years ago

Harper • 0

Hello,

I am totally new in TE analysis (and bioinformatics in general). I used repeatmasker in three closely related species and now I would like to find all the repeated elements common to all the species. My .out files look like this:

 SW   perc perc perc  query       position in query              matching           repeat                position in repeat
 score   div. del. ins.  sequence    begin    end          (left)   repeat             class/family      begin   end    (left)       ID

 11220    0.0  0.0  0.0  ptg000001l         1     9492 (18167109) + (CTAAC)n           Simple_repeat           1   9491     (0)       1  
  1524   25.8  1.9  2.1  ptg000001l      9493    10027 (18166574) + rnd-1_family-70    Unknown               110    643     (0)       2  
  6766    6.2  0.3  3.2  ptg000001l     10032    10922 (18165679) + rnd-5_family-818   Unknown                 1    866     (0)       3  
  5127    5.1  0.0  0.0  ptg000001l     10924    11546 (18165055) + rnd-1_family-464   Unknown                 1    623     (0)       4  
  2991   13.3  3.2  5.9  ptg000001l     11547    12175 (18164426) + rnd-6_family-2133  LINE/R1                 1    613  (2635)       5

I was planning to use bedtools to do the intersection of columns 9 and 10 for the three species, but I do not know if that would be the correct way to do it or if there are other tools that could be more convenient?

Thank you very much!

repeatmasker • 953 views

ADD COMMENT • link 2.9 years ago by Harper • 0

0

Entering edit mode

The solution you propose requires that the genomes have exact (or very similar) coordinates for the annotations. That requirement may or may not be true, it all depends on what "closely related" means in your context.

ADD REPLY • link 2.9 years ago by Istvan Albert 102k

score 1 · Accepted Answer · 2022-01-14

Closely related is not close enough, even for same species, assemblies and estimates of genome size may be very different. Therefore, even with two chromosome-scale assemblies, the coordinates cannot directly be compared. A single insertion would render every direct comparison of coordinates meaningless. You would need to use something like LiftOver, but that would require mapping by sequence similarity, and repeats are repetitive and thus map multiple places. If you still want to compare different regions of the genomes, you could do a whole-genome alignment with Mauve and then compare similar regions and their repeat density. I think however, that this approach is not very straightforward and might cause considerable frustration without giving you much in return.

Let me propose something different: Focus on global repeat composition. It is sort of standard to give the coarse composition of repeat families, like DNA transposons, retrotransposons, LTR, LINE and SINE elements, and simple repeats. This information is found in the repeatmasker output. The repeat composition should be comparable across different assemblies of the same or closely related species, and it doesn't even require that the assembly is very good.