Question

FIxing Gene Models in Funannotate

0

Entering edit mode

3 months ago

SomeOne ▴ 240

HI,

I have been annotateing some fungal genomes with funannotate. I have De-novo assembled genomes and Related RNASeq data which were used together.

So far i have completed the follwoing steps in Funannotate pipeline " Genome and RNA sequencing data ".

Funannotate Train (using RNASeq datasets)
Funannotate Predict
Funannotate Update (to add UTRs and Refine predictions)

After the completion of Update step, the log file shows that There are 52 gene models that need to be fixed. and then there is a list which looks like this

NP02_012546 Feature overlapped by 2 identical-length genes but has no cross-reference
NP02_012547 Feature overlapped by 2 identical-length genes but has no cross-reference

NP02_scf_1: Feature overlapped by 2 identical-length genes but has no cross-reference
NP02_scf_2: Feature overlapped by 2 identical-length genes but has no cross-reference

Here NP02_01254* is the locus tag and NP02_scf_* are the fasta headers in my assembly.

I tried to look into the .tbl file (for example searching NP02_012546 gives me only these five hits

733438  735242  gene
            locus_tag   NP02_012546
733438  735242  mRNA
            product hypothetical protein
            transcript_id   gnl|ncbi|NP02_012546-T1_mrna
            protein_id  gnl|ncbi|NP02_012546-T1
733736  734863  CDS
            codon_start 1
            product hypothetical protein
            transcript_id   gnl|ncbi|NP02_012546-T1_mrna
            protein_id  gnl|ncbi|NP02_012546-T1

and NP02_012547 gives 9 hits

733438  735242  gene
            locus_tag   NP02_012547
733438  733583  mRNA
733647  733665
733757  735242
            product hypothetical protein
            transcript_id   gnl|ncbi|NP02_012547-T1_mrna
            protein_id  gnl|ncbi|NP02_012547-T1
733537  733583  CDS
733647  733665
733757  734863
            codon_start 1
            product hypothetical protein
            transcript_id   gnl|ncbi|NP02_012547-T1_mrna
            protein_id  gnl|ncbi|NP02_012547-T1
733438  733665  mRNA
733757  735242
            product hypothetical protein
            transcript_id   gnl|ncbi|NP02_012547-T2_mrna
            protein_id  gnl|ncbi|NP02_012547-T2
733793  734863  CDS
            codon_start 1
            product hypothetical protein
            transcript_id   gnl|ncbi|NP02_012547-T2_mrna
            protein_id  gnl|ncbi|NP02_012547-T2

Here i am unable to understand, What am i supposed to fix. I am getting multiple transcript hits. which i think are due to the fact that i am using 3 different RNASeq datasets which include, Infection assay, fungal-25-degree and fungal 37-degree, (4 replicates each).

KIndly guide me in solving this problem. Thank you.

genomics funannotate fungus annotation • 712 views

ADD COMMENT • link 3 months ago by SomeOne ▴ 240

score 1 · Accepted Answer · 2025-04-01

1

Entering edit mode

3 months ago

lieven.sterck 15k

Well, it is making you aware of the fact that you have two separate "genes" that span the exact same genomic positions on the genome. This is in the basic form not possible. Either you have isoforms (== multiple mRNAs from the same gene locus) but you can't have two distinct genes (in name that is!) from the exact same genomic position.

So you either denote/annotate them as isoforms from the same gene locus (and thus remove one gene entry from the annotation file but keep two mRNAs) if that is possible or you have to simply remove one of the two genes (perhaps one of them is a clear false positive (over)prediction) with the accompaying mRNAs.

It likely has nothing to do with your input RNAseq data per se, I think it is rather a problem of your gene annotation software that is not capable of recognizing (or predicting) isoforms or failed to prioritize one of the other gene model.

ADD COMMENT • link 3 months ago by lieven.sterck 15k

0

Entering edit mode

Hi lieven.sterck . Thankyou for your response. I just checked the coordinates of the gene and they are toe consectives ones which have same start-stop position in .tbl file.

for example

733438  735242  **gene**
            locus_tag   NP02_012546
733438  735242  mRNA
            product hypothetical protein
            transcript_id   gnl|ncbi|NP02_012546-T1_mrna
            protein_id  gnl|ncbi|NP02_012546-T1
733736  734863  CDS
            codon_start 1
            product hypothetical protein
            transcript_id   gnl|ncbi|NP02_012546-T1_mrna
            protein_id  gnl|ncbi|NP02_012546-T1

733438  735242  **gene**
            locus_tag   NP02_012547
733438  733583  mRNA
733647  733665
733757  735242
            product hypothetical protein
            transcript_id   gnl|ncbi|NP02_012547-T1_mrna
            protein_id  gnl|ncbi|NP02_012547-T1
733537  733583  CDS
733647  733665
733757  734863
            codon_start 1
            product hypothetical protein
            transcript_id   gnl|ncbi|NP02_012547-T1_mrna
            protein_id  gnl|ncbi|NP02_012547-T1
733438  733665  mRNA
733757  735242
            product hypothetical protein
            transcript_id   gnl|ncbi|NP02_012547-T2_mrna
            protein_id  gnl|ncbi|NP02_012547-T2
733793  734863  CDS
            codon_start 1
            product hypothetical protein
            transcript_id   gnl|ncbi|NP02_012547-T2_mrna
            protein_id  gnl|ncbi|NP02_012547-T2

from this example, can you suggest how should i merge them. as these annotations are important and i dont want to mess them.

ADD REPLY • link 3 months ago by SomeOne ▴ 240

0

Entering edit mode

yes exactly, that is what the issue is: you can't have that in a 'normal' gene annotation. (though I know of cases where it is biologically possible but the formats can't handle that, that on a side note ;) )

as said: you either delete one of the two (the one that seems least likely to be correct/true) or you slightly modify the file such that are recognized as isoform transcript from the same gene (usually by removing one of the two gene entries but keeping all mRNA entries and pointing them to the same parent)

[sorry not familiar enough with the tbl format to know from the top of my head]

you could take the protein/cds sequence of both 'genes', blast them against nr_prot or nr_dna and see which one makes more sense and chuck the other one out of the annotation