HI,
I have been annotateing some fungal genomes with funannotate. I have De-novo assembled genomes and Related RNASeq data which were used together.
So far i have completed the follwoing steps in Funannotate
pipeline " Genome and RNA sequencing data ".
- Funannotate Train (using RNASeq datasets)
- Funannotate Predict
- Funannotate Update (to add UTRs and Refine predictions)
After the completion of Update step, the log file shows that There are 52 gene models that need to be fixed.
and then there is a list which looks like this
NP02_012546 Feature overlapped by 2 identical-length genes but has no cross-reference
NP02_012547 Feature overlapped by 2 identical-length genes but has no cross-reference
NP02_scf_1: Feature overlapped by 2 identical-length genes but has no cross-reference
NP02_scf_2: Feature overlapped by 2 identical-length genes but has no cross-reference
Here NP02_01254*
is the locus tag and NP02_scf_*
are the fasta headers in my assembly.
I tried to look into the .tbl
file (for example searching NP02_012546
gives me only these five hits
733438 735242 gene
locus_tag NP02_012546
733438 735242 mRNA
product hypothetical protein
transcript_id gnl|ncbi|NP02_012546-T1_mrna
protein_id gnl|ncbi|NP02_012546-T1
733736 734863 CDS
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|NP02_012546-T1_mrna
protein_id gnl|ncbi|NP02_012546-T1
and NP02_012547
gives 9 hits
733438 735242 gene
locus_tag NP02_012547
733438 733583 mRNA
733647 733665
733757 735242
product hypothetical protein
transcript_id gnl|ncbi|NP02_012547-T1_mrna
protein_id gnl|ncbi|NP02_012547-T1
733537 733583 CDS
733647 733665
733757 734863
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|NP02_012547-T1_mrna
protein_id gnl|ncbi|NP02_012547-T1
733438 733665 mRNA
733757 735242
product hypothetical protein
transcript_id gnl|ncbi|NP02_012547-T2_mrna
protein_id gnl|ncbi|NP02_012547-T2
733793 734863 CDS
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|NP02_012547-T2_mrna
protein_id gnl|ncbi|NP02_012547-T2
Here i am unable to understand, What am i supposed to fix. I am getting multiple transcript hits. which i think are due to the fact that i am using 3 different RNASeq datasets which include, Infection assay, fungal-25-degree and fungal 37-degree, (4 replicates each)
.
KIndly guide me in solving this problem. Thank you.
Hi lieven.sterck . Thankyou for your response. I just checked the coordinates of the gene and they are toe consectives ones which have same start-stop position in .tbl file.
for example
from this example, can you suggest how should i merge them. as these annotations are important and i dont want to mess them.
yes exactly, that is what the issue is: you can't have that in a 'normal' gene annotation. (though I know of cases where it is biologically possible but the formats can't handle that, that on a side note ;) )
as said: you either delete one of the two (the one that seems least likely to be correct/true) or you slightly modify the file such that are recognized as isoform transcript from the same gene (usually by removing one of the two gene entries but keeping all mRNA entries and pointing them to the same parent)
[sorry not familiar enough with the tbl format to know from the top of my head]
you could take the protein/cds sequence of both 'genes', blast them against nr_prot or nr_dna and see which one makes more sense and chuck the other one out of the annotation
ok. Thank you for the assistance. :)