I have used the Braker2 pipeline to generate annotations of my draft genomes and have then converted these into EMBL file types with sequences included. I am stuck on how I get from these predicted gene sequences to actual gene labels (such as identifying the sequence as a DNA helicase etc.) and then UniProt IDs. I assume I am needing to do some form of functional annotation of these predicted gene sequences?
Currently my EMBL files look like this:
FH Key Location/Qualifiers
FH
FT source 1..965
FT /mol_type="genomic DNA"
FT /organism="Termitomyces"
FT gene complement(1..206)
FT /locus_tag="XXX_LOCUS1061"
FT /note="ID:file_1_file_1_g478"
FT /note="source:AUGUSTUS"
FT mRNA complement(1..206)
FT /locus_tag="XXX_LOCUS1061"
FT /note="ID:file_1_file_1_g478.t1"
FT /note="source:AUGUSTUS"
FT exon complement(1..206)
FT /locus_tag="XXX_LOCUS1061"
FT /note="ID:exon-48328"
FT /note="source:AUGUSTUS"
FT CDS complement(<1..206)
FT /codon_start=1
FT /locus_tag="XXX_LOCUS1061"
FT /note="ID:cds-48331"
FT /note="source:AUGUSTUS"
FT /transl_table=1
FT gene 298..965
FT /locus_tag="XXX_LOCUS1062"
FT /note="ID:file_1_file_1_g479"
FT /note="source:AUGUSTUS"
FT mRNA join(298..497,549..655,706..755,808..965)
FT /locus_tag="XXX_LOCUS1062"
FT /note="ID:file_1_file_1_g479.t1"
FT /note="source:AUGUSTUS"
FT exon 298..497
FT /locus_tag="XXX_LOCUS1062"
FT /note="ID:exon-20562"
FT /note="source:AUGUSTUS"
FT exon 549..655
FT /locus_tag="XXX_LOCUS1062"
FT /note="ID:exon-20563"
FT /note="source:AUGUSTUS"
FT exon 706..755
FT /locus_tag="XXX_LOCUS1062"
FT /note="ID:exon-20564"
FT /note="source:AUGUSTUS"
FT exon 808..965
FT /locus_tag="XXX_LOCUS1062"
FT /note="ID:exon-20565"
FT /note="source:AUGUSTUS"
FT CDS join(298..497,549..655,706..755,808..>965)
FT /codon_start=1
FT /locus_tag="XXX_LOCUS1062"
FT /note="ID:cds-20563"
FT /note="ID:cds-20564"
FT /note="ID:cds-20565"
FT /note="ID:cds-20566"
FT /note="source:AUGUSTUS"
FT /transl_table=1
FT intron 498..548
FT /locus_tag="XXX_LOCUS1062"
FT /note="ID:intron-15475"
FT /note="source:AUGUSTUS"
FT intron 656..705
FT /locus_tag="XXX_LOCUS1062"
FT /note="ID:intron-15476"
FT /note="source:AUGUSTUS"
FT intron 756..807
FT /locus_tag="XXX_LOCUS1062"
FT /note="ID:intron-15477"
FT /note="source:AUGUSTUS"
XX
SQ Sequence 965 BP; 213 A; 302 C; 219 G; 231 T; 0 other;
GCCCCCTCGT TGCTAGCTTT GATGAGCTCA TGTGTGGTCT TGGGACCAAG ATCGACTGTA 60
TCGCCATTGA CCGTGGTAGA GTCGATTGAG TCTGGGGCGG GAGCGGTGGG GTCTGCATCT 120
GGCCCAAGGG GGTCCTCAGT GAAGGCGACG GTCTTTTTCT TTCTCTTCTT GAGGGACGGG 180
TCGAACAGCG GTTCTTCTGA GGCCATCGTC GTGGTTGGTG ACTCCAAAAT GTGCGGGTGT 240
GACGGCGGTG TGACGCTGGA TCGGACGCGT GACGGCATAC TTTAGGTGAT AACCACGATG 300
GCGCCGTCCG AGTCGCTGGA CACGATCCTC AACCAAATCA CGACTTCCAA CAATGCTCAA 360
GCCCTCAACC ACACTCTACG AACAAATCTT CCCAAGGAAT CGCGCGACAT TATCCTCGCA 420
AGCACTCTTT CCAGCGGCCA GGACCCGTTG ACTGTGCTAG ACATGAGGGA GAACACTCTA 480
GGAGTGCTGT GGATTCTGTG AGATTCAGTA CAGGTTTTTT TCAAACATGG TCCTGACTCC 540
ACTCACAGTG CAGCGCGATT GACCTTGCAG ACAGCAACGC CACCGCCGTG GCCCCTTGTC 600
CAAGAGTTTT GCCACACTTT TATTCCAGAG CATGCGCGCC TCGCTCCCGA TCGTAGTACG 660
TTCCTCCAAA TCACTACCAT TACAAAAACT GAATACGGTT ACCAGTGACC GCCGTTGCAC 720
GAGGGATCTC TGCATATGCC AATGCCTCGC CAAATGTAAT ACCGTCTCAC TTCTTTCATC 780
CATCTAATCT GACACATCTC TTCCCAGCCG AAAGCCGCCA TCCTGCCCCT GTTCGACCTC 840
ATACGGCGCT ATCCACCCAA TCTCTCCTAC CTTACCTCCA TCCACACTAT ATTCGCCCTC 900
GTACGTCCCA CAACTCCCCA TCACATCCCA CTAACACCCC AAAACAGTCC TGCGTATCCA 960
CCCAA
965
Any help would be appreciated.
Should I blast just the CDS regions or the whole predicted gene? Either way I get getting a lot of
No significant similarity found
errors.When you are considering CDS, you will get less sequences with No hits compare to total genes as they are coding sequences. However complete gene set will include coding as well as non-coding genes.
When you are considering CDS, you will get less sequences with No hits compare to total genes as they are coding sequences. However complete gene set will include coding as well as non-coding genes.