Hi, i am currently using Maker to annotate a de novo genome and i'm using Repeat Modeler to find repeats in this genome. There is however very little information about the inner workings of Repeat Modeler and I was wondering if someone could help me clear something up.
I was wondering how the sampling works in RepeatModeler. On their webpage they have the following example data
Genome DB Sample*** Run Time* Models Models % Sample Genome Size (bp) Size (bp) (hh:mm) Built Classified Masked** ---------- ----------- ---------- ---------- ------- ----------- -------- Human HG18 3.1 Bbp 238 Mbp 46:36 614 611 35.66 Zebrafinch 1.3 Bbp 220 Mbp 63:57 233 104 9.41 Sea Urchin 867 Mbp 220 Mbp 40:03 1830 360 33.85 diatom 32,930,227 32,930,227 4:41 128 35 2.86 Rabbit 11,770,949 11,770,949 3:14 83 72 31.30
*** Sample size does not include 40 Mbp used in the RepeatScout analysis. This 40 Mbp is randomly chosen and may overlap 0-100% of the sample used in the RECON analysis.
So it's clear that they use 40mb for RepeatScout, but does it take it from the biggest contig or does it just take multiple contigs until it reaches the 40 mb?
Then they reach the +-220mb for most test but after looking at the RepeatModeler's code there are some static limits that seem to contradict these sample sizes. Below you can see the lines i mean (line numbers 255 through 263)
my $rsSampleSize = 40000000; # The size of the sequence given to RepeatScout. ( round #1 ) my $fragmentSize = 40000; # The size of the batches for all-vs-other search. my $genomeSampleSizeStart = 3000000; # The initial sample size for RECON analysis my $genomeSampleSizeMax = 100000000; # The max sample size for RECON analysis my $genomeSampleSizeMultiplier = 3; # The multiplier for sample size between rounds
These limits suggest that 40mb is sent to RepeatScout and 3mb (round 2), 9mb (round 3), 27mb (round 4) and 81mb (round 5) is sent to RECON (because the max sample size is 100mb it can't start a 6th round) this creates a total of 160mb sampled in my train of thought. So what I am missing in this process?
Thanks in advance for any information regarding this subject :D
Thanks that seems to explain why they end up with the 220 on two of their runs.
But then how come they reach the 238 mb in their human tests set? It seems so weird that there is no explanation about this crucial part of their program on the website.