Question

masked vs unmasked genome annotation comparison

0

Entering edit mode

2.6 years ago

mthm ▴ 50

I have created two types of genome annotation (using braker) one using the soft-masked version of the genome and one without the masking. of course the number of annotated regions in the unmasked genome are much more, but I noticed the regions that are found in both files have the same coordinates and are comparable. how can I compare these two files and how can I know which one is more accurate?

annotation masked • 2.5k views

ADD COMMENT • link 2.6 years ago by mthm ▴ 50

0

Entering edit mode

Soft masked regions are regions where the base letters are converted to lowercase versus hard masking where the bases are converted to N, so there won't be an effect on feature ranges with either type of masking. Whether soft masking is better than no masking will depend on your use case. Some software will ignore the soft masking, others will take soft masking into account.

ADD REPLY • link 2.6 years ago by rpolicastro 13k

0

Entering edit mode

rpolicastro so basically if a repeat element falls inside an intron or exon, it won't be removed during gene annotation even though it is softmaksed right? it only means that by ignoring them the annotator can detect the area margins more accurately?

ADD REPLY • link 2.6 years ago by mthm ▴ 50

score 4 · Accepted Answer · 2022-04-29

Hi, is there a reference genome with annotations? If yes, compare it with the reference annotation. If not, you are facing the problem that you do not know if your annotation is accurate or not (independent of which tool or if it was soft-masked or not). That means you can only estimate accuracy.

There are multiple ways to proceed. For example, you can check the coding sequences with BUSCO and see if all expected genes are present for your taxonomical lineage (or in which version are more present). Another option is to perform functional annotation with harsh criteria (to reduce random hits) and see statistically if you catch more genes.

As a side bonus, you can perform single genome annotations with MOSGA (with BRAKER) with and without a masking tool and, in the second step, perform MOSGAs gene comparison (in the comparative genomics mode) with these annotations. That will give you a value of how similar the independent the gene predictions are.

Usually, masked genomes only mean that a specific area will be ignored. There is by chance, based on a random seed, a possibility that BRAKER will produce similar by not identical results per run. Are you sure that your differences derive from only the masking? Did you try to run BRAKER multiple times?