I am working on a (plant) genome annotation pipeline and would like some advice regarding repeat masking. My pipeline consists of running several ab-initio gene prediction tools (Augustus, GlimmerHMM and SNAP) + transcript alignment (PASA) + protein alignment (genomeThreader) evidence + gene liftover (liftoff), and finally generating gene models using EvidenceModeler.
I am wondering about the best way to go about repeat masking within this pipeline. Specifically, my questions are:
- When should I do it - should the masking be done right at the beginning, before running any ab-initio or alignment tool? Alternatively, maybe I should generate gene models on the un-masked genome and only intersect gene models with repeat annotations at the end and filter using a more sophisticated method?
- Should I apply hard or soft masking?
- What software should I use? I see for instance that EDTA can be used for TE detection, but should I also use a tool like RepeatMasker for other types of repetitive elements, or is this redundant in some way?
I should mention that my main focus is protein coding genes, and I'm not so interested in TE annotation and classification at this point.
Any suggestion or advice is welcome. Thank you!
@liorglic Curious if you are able to figure out on tools at this point?
Still not much of an expert, but I think masking at the beginning is the way to go. Running EDTA and RepeatMasker should do the trick, but honestly I'm not sure my advice is very reliable...
Thanks. appreciate your feedback. For genomeThreader (GT), I was wondering how you handled it speed. Its seems relatively very slow. Is there a way to speed it up, as I couldn't find much information online related to speeding up the GT.
I think the best you can do is just slice the genome into windows of fixed size and let GT work on each of them separately, then combine everything at the end. This way you can parallelize the work. There may also be newer/better alternatives for GT, but I am not aware of them.