masked vs unmasked genome annotation comparison
1
0
Entering edit mode
2.6 years ago
mthm ▴ 50

I have created two types of genome annotation (using braker) one using the soft-masked version of the genome and one without the masking. of course the number of annotated regions in the unmasked genome are much more, but I noticed the regions that are found in both files have the same coordinates and are comparable. how can I compare these two files and how can I know which one is more accurate?

annotation masked • 2.5k views
ADD COMMENT
0
Entering edit mode

Soft masked regions are regions where the base letters are converted to lowercase versus hard masking where the bases are converted to N, so there won't be an effect on feature ranges with either type of masking. Whether soft masking is better than no masking will depend on your use case. Some software will ignore the soft masking, others will take soft masking into account.

ADD REPLY
0
Entering edit mode

rpolicastro so basically if a repeat element falls inside an intron or exon, it won't be removed during gene annotation even though it is softmaksed right? it only means that by ignoring them the annotator can detect the area margins more accurately?

ADD REPLY
4
Entering edit mode
2.6 years ago

Hi, is there a reference genome with annotations? If yes, compare it with the reference annotation. If not, you are facing the problem that you do not know if your annotation is accurate or not (independent of which tool or if it was soft-masked or not). That means you can only estimate accuracy.

There are multiple ways to proceed. For example, you can check the coding sequences with BUSCO and see if all expected genes are present for your taxonomical lineage (or in which version are more present). Another option is to perform functional annotation with harsh criteria (to reduce random hits) and see statistically if you catch more genes.

As a side bonus, you can perform single genome annotations with MOSGA (with BRAKER) with and without a masking tool and, in the second step, perform MOSGAs gene comparison (in the comparative genomics mode) with these annotations. That will give you a value of how similar the independent the gene predictions are.

Usually, masked genomes only mean that a specific area will be ignored. There is by chance, based on a random seed, a possibility that BRAKER will produce similar by not identical results per run. Are you sure that your differences derive from only the masking? Did you try to run BRAKER multiple times?

ADD COMMENT
0
Entering edit mode

Thanks for your thorough explanation. I have annotated the genome using RNAseq reference. I have run braker with and without --softmasking argument independently based on the type of input genome. My goal is to find transposable elements distribution throughout the genome based on the annotated regions. I have the TE coordinates and intersected them to both genome annotations and of course got more hits in the case of unmasked genome. e.g. exon 3815, intron 35147 TEs for the softmasked genome and exon 15025, intron 46277 TEs for the unmasked genome, but I have no idea which one is closer to the reality. I think comparing the two annotations using "MOSGAs gene comparison" is a good idea, I will try that, thanks.

ADD REPLY

Login before adding your answer.

Traffic: 1763 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6