I wanted to know whether repeatmasking a genome or genome fragment, before performing gene annotation, is a good idea. The literature (http://www.nature.com/nrg/journal/v1...l/nrg3174.html, http://onlinelibrary.wiley.com/doi/1...eva.12178/full) and software such as MAKER, and PASA advocate repeatmasking before annotation of genes. Some others say that gene annotation pipelines should be run twice. With and without repeatmasking. I understand the logic behind repeat-masking and then performing gene annotation. However, running a gene annotation on an unmasked genome especially when repeat related genes are not my concern... I don't understand. Therefore looking for answers to the following questions.
1) What are the chances that a non-repeat related gene contains a repetitive region (lets say part of the gag-pol domain is present in a gene, or some exon contains a satellite repeat)? Are there any such cases reported?
2) For genome reference guided transcriptome assembly purposes is it recommended that a masked genome be used? I agree that for expression quantification, this may lead to overestimation or under representation in some cases.
Not sure if you have got answers from someplace else. Just want to share some points. Correct me if I am wrong.
Repetitive sequences composed 40~60% of the eukaryotes genomes, especially complex ones. It's very likely to overestimate gene models with an unmasked genome. Yes, some repeats may look like exons, and my look like coding exons. And some repeats Transposable Elements (TEs) are actually protein-coding, and only work for moving around TEs within the genome.
For reference guided transcriptome assembly, maybe less affect but softmasked reference would be better. Filtering the counts for repeats like rRNA regions may change the expression matrix a little bit, as those are usually highly expressed, even with rRNA depletion or polyA enrichment protocols. Some DE method normalization step is based on high expressed sets of genes.
Hi:
It is usually a normal procedure to mask your genome before annotation, otherwise annotation would be difficult due to the repeat region. Always use the software RepeatMasker and RepeatModeler to mask the repeat region.