We have a denovo genome assembly, and are looking for repetitive elements (transposons, ideally) for submission to NCBI and RepBase. So far, the plan is:
- Mask known repeats in the genome with RepeatMasker and the RepBase libraries
- Denovo repeat finding on the masked genome with RepeatScout, including filtering out low complexity regions that RepeatMasker didn't pick up.
- Filter out repeats that have matches in gene regions (the sequences are likely to belong to a gene family, or be part of a conserved domain)
- Blast each of the repeat sequences identified by RepeatScout against NR, discarding sequences that match genes or previously identified transposons.
- Submit remaining sequences to RepBase and NCBI as unclassified repeats.
This process feels incomplete to me, and doesn't include any classification. Is there a formal process for identification and classification of repetitive elements in denovo genome assemblies?
The descriptions there don't go much further than what I had already outlined, but it did link me to the very comprehensive list of tools at the Bergman Lab, which led me to their review article. I might sketch out an answer based on the article later.