Recently, I completed the assembly of a plant genome. This plant species is a member of the monocot angiosperm group. And, reading the informative response by @SES and @ Andrzej Zielezinski in a post I was convinced of the importance of masking repetitive elements before predicting genes. I have decided to use Repeat Masker for this purpose, as mentioned in the post. Now, coming to the question:
- I read that a repeat database needs to be installed and Dfam is an open database of TEs.
"a minimal version of Dfam 3.8 ( root partition ) can be downloaded automatically by the configure script. Additional taxa partitions may be downloaded and configured at any time."
Incidentally, I came across a database of plant repetitive elements - PlantRep, which has annotated repeats from 459 plant genomes (of which annotated repeats from 70 monocot genomes). So, what is better: downloading the Dfam 3.8 or specifying a custom library that contains the repetitive elements from PlantRep in fasta format?
Any input is highly appreciated.
b.contreras.moreira, thank you for the detailed answer. Can you detail a bit on the nrTEplants given in your answer? Is it a database that I could use in the RepeatMasker as such? Also, going through the PlantRep database, I understood that it is a database constructed using Dfam and RepBase as references, in which case can I use the Dfam database, along with the nrTEplants(as a -lib argument)? Thank you!
If your species has related ones within PlantRep then I would recommend downloading the repeat sequences of those and use them to annotate your own repeats.
Regarding nrTEplants, you can also give it a try, but it will only perform well if the repeat collections used to compile it (listed at https://github.com/Ensembl/plant_tools/tree/master/bench/repeat_libs) contain sequences close to those in your genome. Given the number of species in PlantRep that's probably a safer bet.