Hi everyone,
I am running RepeatModeler2 to create a de novo TE library for a PacBio bird genome. My goal is to curate the library and create a high quality transposable element annotation of the genome, as well as using the repeat library to mask the genome before gene annotation.
According to the RepeatModeler2 paper, the default option is to sample 363 total Mbp from the genome, which works out to less than 20% of most vertebrate genomes. However, there is an option to modify the sample size, including sampling the entire genome.
I want to make sure I understand the tradeoff here. Is the rationale that you are likely to find the majority of all high copy number repeats within a small sample of the genome, such that a larger sample size leads to diminishing returns? If I have the server time, would it be ideal to run RepeatModeler with complete coverage? Or is there some downside to covering the whole genome that I am unaware of?
If you have experience and a moment to respond, I appreciate it greatly!
Not sure, but I also think that the rationale is that you'll probably find representatives of most repeat families in these 363 Mb, so it is not necessary to use the whole genome. Using only 363 Mb saves time. Though, results may be somewhat more accurate if you use the whole genome.
Yeah, that's what I was thinking! I don't see people talk about their sample size often in papers, so I may try a few different ones and see how different the results are.