Question

How to compare gene prediction

0

Entering edit mode

11 weeks ago

Acervo • 0

Hello! I am working on a gene prediction without evidence (protein, RNA-seq). I used three tools for this: BRAKER3, FUNANNOTATE and Helixer. How can I evaluate which prediction was better? Just by viewing the results and comparing them in JBrowse? Or are there metrics that support the decision?

I am a beginner and did not find any useful information in the research I did.

gene-prediction • 699 views

ADD COMMENT • link updated 10 weeks ago by lieven.sterck 15k • written 11 weeks ago by Acervo • 0

0

Entering edit mode

Before we start given technical (hands-on) advise: How would you consider "which prediction was better" ? on what level do you want to evaluate the predictions? ...

It's quite a daunting task for a beginner indeed. You could have a look at some publications of annotation tools. They often present why they are doing better than others and you can sort of use the some logic/approaches to compare your predictions.

ADD REPLY • link 10 weeks ago by lieven.sterck 15k

score 1 · Answer 1 · 2025-01-20

1

Entering edit mode

11 weeks ago

shelkmike ★ 1.5k

First, you can make FASTA files with proteins and analyse the proteins by BUSCO. Larger "C" ("completeness") means that an annotator missed less genes.
Besides missing genes, there may be an opposite problem: false genes. Sometimes an annotator calls a piece of noncoding DNA a gene. This problem is, as far as I know, harder to control for. One possible solution is to align all proteins to a large protein database (for example, NCBI nr or Uniref90) by DIAMOND and see what percentage of your proteins don't have matches with e-value <= 10^-5. A typical eukaryotic genome has less than 100,000 genes, hence the threshold of 10^-5.

ADD COMMENT • link 11 weeks ago by shelkmike ★ 1.5k

0

Entering edit mode

I will do it! Thank you so much for your help!!

ADD REPLY • link 11 weeks ago by Acervo • 0

0

Entering edit mode

I think the first part of your advice is on point. However, I don't see the logic with this part:

One possible solution is to align all proteins to a large protein database (for example, NCBI nr or Uniref90) by DIAMOND and see what percentage of your proteins don't have matches with e-value <= 10^-5. A typical eukaryotic genome has less than 100,000 genes, hence the threshold of 10^-5.

E-values are calculated based on the sequence number of the target database (nr or UniRef90, per your suggestion). As these databases differ in size (nr > UniRef90), by searching them we will get non-identical subsets of sequences clearing the 10^-5 threshold. It has nothing to do with the original genome size.

ADD REPLY • link 11 weeks ago by Mensur Dlakic ★ 29k

1

Entering edit mode

I choose an e-value such that it results in a reasonably small number of false matches. If the annotation produced less than 100,000 proteins and we align them to some reference database and set the e-value threshold to 10^-5, then (if we assume that e-values are calculated precisely), if I understand correctly, we guarantee that there will be less than 1 false match. This is the reason why I recommended the e-value threshold of 10^-5.

ADD REPLY • link 10 weeks ago by shelkmike ★ 1.5k

1

Entering edit mode

I kinda understand what you're going for but unfortunately your reasoning is off I'm afraid.

As Mensur Dlakic already indicated the E-value is a function of the database size and the query sequence length, so changing any of these two parameters will influence the resulting E-value. So it has nothing to do with the number of genes potentially in a genome.

Moreover, you will have a substantial amount of true/real genes that will never get to the 10^-5 E-values, simply due to the fact they are small genes/proteins.

In essence you're suggestion is fine, I would just not pin myself to such a threshold (and certainly not based on the reasoning used)

ADD REPLY • link 10 weeks ago by lieven.sterck 15k