How to compare gene prediction
1
0
Entering edit mode
11 weeks ago
Acervo • 0

Hello! I am working on a gene prediction without evidence (protein, RNA-seq). I used three tools for this: BRAKER3, FUNANNOTATE and Helixer. How can I evaluate which prediction was better? Just by viewing the results and comparing them in JBrowse? Or are there metrics that support the decision?

I am a beginner and did not find any useful information in the research I did.

gene-prediction • 699 views
ADD COMMENT
0
Entering edit mode

Before we start given technical (hands-on) advise: How would you consider "which prediction was better" ? on what level do you want to evaluate the predictions? ...

It's quite a daunting task for a beginner indeed. You could have a look at some publications of annotation tools. They often present why they are doing better than others and you can sort of use the some logic/approaches to compare your predictions.

ADD REPLY
1
Entering edit mode
11 weeks ago
shelkmike ★ 1.5k

First, you can make FASTA files with proteins and analyse the proteins by BUSCO. Larger "C" ("completeness") means that an annotator missed less genes.
Besides missing genes, there may be an opposite problem: false genes. Sometimes an annotator calls a piece of noncoding DNA a gene. This problem is, as far as I know, harder to control for. One possible solution is to align all proteins to a large protein database (for example, NCBI nr or Uniref90) by DIAMOND and see what percentage of your proteins don't have matches with e-value <= 10^-5. A typical eukaryotic genome has less than 100,000 genes, hence the threshold of 10^-5.

ADD COMMENT
0
Entering edit mode

I will do it! Thank you so much for your help!!

ADD REPLY
0
Entering edit mode

I think the first part of your advice is on point. However, I don't see the logic with this part:

One possible solution is to align all proteins to a large protein database (for example, NCBI nr or Uniref90) by DIAMOND and see what percentage of your proteins don't have matches with e-value <= 10^-5. A typical eukaryotic genome has less than 100,000 genes, hence the threshold of 10^-5.

E-values are calculated based on the sequence number of the target database (nr or UniRef90, per your suggestion). As these databases differ in size (nr > UniRef90), by searching them we will get non-identical subsets of sequences clearing the 10^-5 threshold. It has nothing to do with the original genome size.

ADD REPLY
1
Entering edit mode

I choose an e-value such that it results in a reasonably small number of false matches. If the annotation produced less than 100,000 proteins and we align them to some reference database and set the e-value threshold to 10^-5, then (if we assume that e-values are calculated precisely), if I understand correctly, we guarantee that there will be less than 1 false match. This is the reason why I recommended the e-value threshold of 10^-5.

ADD REPLY
1
Entering edit mode

I kinda understand what you're going for but unfortunately your reasoning is off I'm afraid.

As Mensur Dlakic already indicated the E-value is a function of the database size and the query sequence length, so changing any of these two parameters will influence the resulting E-value. So it has nothing to do with the number of genes potentially in a genome.

Moreover, you will have a substantial amount of true/real genes that will never get to the 10^-5 E-values, simply due to the fact they are small genes/proteins.

In essence you're suggestion is fine, I would just not pin myself to such a threshold (and certainly not based on the reasoning used)

ADD REPLY

Login before adding your answer.

Traffic: 1538 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6