Question

Should I use Dfam or a custom repetitve elements library (from PlantRep) as a repeat database; using RepeatMasker in a Linux machine?

0

Entering edit mode

8 months ago

Vijith ▴ 90

Recently, I completed the assembly of a plant genome. This plant species is a member of the monocot angiosperm group. And, reading the informative response by @SES and @ Andrzej Zielezinski in a post I was convinced of the importance of masking repetitive elements before predicting genes. I have decided to use Repeat Masker for this purpose, as mentioned in the post. Now, coming to the question:

I read that a repeat database needs to be installed and Dfam is an open database of TEs. "a minimal version of Dfam 3.8 ( root partition ) can be downloaded automatically by the configure script. Additional taxa partitions may be downloaded and configured at any time." Incidentally, I came across a database of plant repetitive elements - PlantRep, which has annotated repeats from 459 plant genomes (of which annotated repeats from 70 monocot genomes). So, what is better: downloading the Dfam 3.8 or specifying a custom library that contains the repetitive elements from PlantRep in fasta format?

Any input is highly appreciated.

genome sequence repeatmasker blast • 1.1k views

ADD COMMENT • link updated 8 months ago by b.contreras.moreira ▴ 310 • written 8 months ago by Vijith ▴ 90

score 1 · Answer 1 · 2024-03-26

1

Entering edit mode

8 months ago

b.contreras.moreira ▴ 310

In https://doi.org/10.1002/tpg2.20143 we found that RepeatMasker underestimated repeat content in plants when using REdat as repeat database. This was due to the database used as results improved with our own custom database nrTEplants. More generally, I would expect RepeatMasker to work well if your genome of interest contains repeats similar to those in your reference database, such as PlantRep or others out there. Note we did not test RepBase, which was the recommended database, as it required subscription at the time. It seems the current academic use agreement might be an option for you.

Anyway, in that study we concluded that repeat masking by k-mer analysis worked well in plants, it's very fast, and does not require a database. You can save a lot of time if you don't really need to annotate the repeats and masking them to guide gene annotation is sufficient. Note that you can still annotate them by sequence similarity afterwards, but this is optional. See our protocol for this at https://github.com/Ensembl/plant-scripts/tree/master/repeats

A highly cited protocol for annotation of plant repeats is https://github.com/oushujun/EDTA, which was described at https://doi.org/10.1186/s13059-019-1905-y

Hope this helps

ADD COMMENT • link 8 months ago by b.contreras.moreira ▴ 310

0

Entering edit mode

b.contreras.moreira, thank you for the detailed answer. Can you detail a bit on the nrTEplants given in your answer? Is it a database that I could use in the RepeatMasker as such? Also, going through the PlantRep database, I understood that it is a database constructed using Dfam and RepBase as references, in which case can I use the Dfam database, along with the nrTEplants(as a -lib argument)? Thank you!

ADD REPLY • link 8 months ago by Vijith ▴ 90

1

Entering edit mode

If your species has related ones within PlantRep then I would recommend downloading the repeat sequences of those and use them to annotate your own repeats.

Regarding nrTEplants, you can also give it a try, but it will only perform well if the repeat collections used to compile it (listed at https://github.com/Ensembl/plant_tools/tree/master/bench/repeat_libs) contain sequences close to those in your genome. Given the number of species in PlantRep that's probably a safer bet.

ADD REPLY • link 8 months ago by b.contreras.moreira ▴ 310