Question

How to get from annotated sequences to UniProt IDs

0

Entering edit mode

3.9 years ago

robert.murphy ▴ 110

I have used the Braker2 pipeline to generate annotations of my draft genomes and have then converted these into EMBL file types with sequences included. I am stuck on how I get from these predicted gene sequences to actual gene labels (such as identifying the sequence as a DNA helicase etc.) and then UniProt IDs. I assume I am needing to do some form of functional annotation of these predicted gene sequences?

Currently my EMBL files look like this:

FH   Key             Location/Qualifiers
FH
FT   source          1..965
FT                   /mol_type="genomic DNA"
FT                   /organism="Termitomyces"
FT   gene            complement(1..206)
FT                   /locus_tag="XXX_LOCUS1061"
FT                   /note="ID:file_1_file_1_g478"
FT                   /note="source:AUGUSTUS"
FT   mRNA            complement(1..206)
FT                   /locus_tag="XXX_LOCUS1061"
FT                   /note="ID:file_1_file_1_g478.t1"
FT                   /note="source:AUGUSTUS"
FT   exon            complement(1..206)
FT                   /locus_tag="XXX_LOCUS1061"
FT                   /note="ID:exon-48328"
FT                   /note="source:AUGUSTUS"
FT   CDS             complement(<1..206)
FT                   /codon_start=1
FT                   /locus_tag="XXX_LOCUS1061"
FT                   /note="ID:cds-48331"
FT                   /note="source:AUGUSTUS"
FT                   /transl_table=1
FT   gene            298..965
FT                   /locus_tag="XXX_LOCUS1062"
FT                   /note="ID:file_1_file_1_g479"
FT                   /note="source:AUGUSTUS"
FT   mRNA            join(298..497,549..655,706..755,808..965)
FT                   /locus_tag="XXX_LOCUS1062"
FT                   /note="ID:file_1_file_1_g479.t1"
FT                   /note="source:AUGUSTUS"
FT   exon            298..497
FT                   /locus_tag="XXX_LOCUS1062"
FT                   /note="ID:exon-20562"
FT                   /note="source:AUGUSTUS"
FT   exon            549..655
FT                   /locus_tag="XXX_LOCUS1062"
FT                   /note="ID:exon-20563"
FT                   /note="source:AUGUSTUS"
FT   exon            706..755
FT                   /locus_tag="XXX_LOCUS1062"
FT                   /note="ID:exon-20564"
FT                   /note="source:AUGUSTUS"
FT   exon            808..965
FT                   /locus_tag="XXX_LOCUS1062"
FT                   /note="ID:exon-20565"
FT                   /note="source:AUGUSTUS"
FT   CDS             join(298..497,549..655,706..755,808..>965)
FT                   /codon_start=1
FT                   /locus_tag="XXX_LOCUS1062"
FT                   /note="ID:cds-20563"
FT                   /note="ID:cds-20564"
FT                   /note="ID:cds-20565"
FT                   /note="ID:cds-20566"
FT                   /note="source:AUGUSTUS"
FT                   /transl_table=1
FT   intron          498..548
FT                   /locus_tag="XXX_LOCUS1062"
FT                   /note="ID:intron-15475"
FT                   /note="source:AUGUSTUS"
FT   intron          656..705
FT                   /locus_tag="XXX_LOCUS1062"
FT                   /note="ID:intron-15476"
FT                   /note="source:AUGUSTUS"
FT   intron          756..807
FT                   /locus_tag="XXX_LOCUS1062"
FT                   /note="ID:intron-15477"
FT                   /note="source:AUGUSTUS"
XX
SQ   Sequence 965 BP; 213 A; 302 C; 219 G; 231 T; 0 other;
     GCCCCCTCGT TGCTAGCTTT GATGAGCTCA TGTGTGGTCT TGGGACCAAG ATCGACTGTA        60
     TCGCCATTGA CCGTGGTAGA GTCGATTGAG TCTGGGGCGG GAGCGGTGGG GTCTGCATCT       120
     GGCCCAAGGG GGTCCTCAGT GAAGGCGACG GTCTTTTTCT TTCTCTTCTT GAGGGACGGG       180
     TCGAACAGCG GTTCTTCTGA GGCCATCGTC GTGGTTGGTG ACTCCAAAAT GTGCGGGTGT       240
     GACGGCGGTG TGACGCTGGA TCGGACGCGT GACGGCATAC TTTAGGTGAT AACCACGATG       300
     GCGCCGTCCG AGTCGCTGGA CACGATCCTC AACCAAATCA CGACTTCCAA CAATGCTCAA       360
     GCCCTCAACC ACACTCTACG AACAAATCTT CCCAAGGAAT CGCGCGACAT TATCCTCGCA       420
     AGCACTCTTT CCAGCGGCCA GGACCCGTTG ACTGTGCTAG ACATGAGGGA GAACACTCTA       480
     GGAGTGCTGT GGATTCTGTG AGATTCAGTA CAGGTTTTTT TCAAACATGG TCCTGACTCC       540
     ACTCACAGTG CAGCGCGATT GACCTTGCAG ACAGCAACGC CACCGCCGTG GCCCCTTGTC       600
     CAAGAGTTTT GCCACACTTT TATTCCAGAG CATGCGCGCC TCGCTCCCGA TCGTAGTACG       660
     TTCCTCCAAA TCACTACCAT TACAAAAACT GAATACGGTT ACCAGTGACC GCCGTTGCAC       720
     GAGGGATCTC TGCATATGCC AATGCCTCGC CAAATGTAAT ACCGTCTCAC TTCTTTCATC       780
     CATCTAATCT GACACATCTC TTCCCAGCCG AAAGCCGCCA TCCTGCCCCT GTTCGACCTC       840
     ATACGGCGCT ATCCACCCAA TCTCTCCTAC CTTACCTCCA TCCACACTAT ATTCGCCCTC       900
     GTACGTCCCA CAACTCCCCA TCACATCCCA CTAACACCCC AAAACAGTCC TGCGTATCCA       960
     CCCAA  

                                                             965

Any help would be appreciated.

annotation • 1.3k views

ADD COMMENT • link updated 3.9 years ago by Tm ★ 1.1k • written 3.9 years ago by robert.murphy ▴ 110

score 1 · Answer 1 · 2021-07-28

1

Entering edit mode

3.9 years ago

Tm ★ 1.1k

You can use fasta sequences of gene predicted using Braker2 pipeline and perform blast against NCBI's NR protein database or Uniprot database to get uniprot IDs.

ADD COMMENT • link 3.9 years ago by Tm ★ 1.1k

0

Entering edit mode

Should I blast just the CDS regions or the whole predicted gene? Either way I get getting a lot of No significant similarity found errors.

ADD REPLY • link 3.9 years ago by robert.murphy ▴ 110

0

Entering edit mode

When you are considering CDS, you will get less sequences with No hits compare to total genes as they are coding sequences. However complete gene set will include coding as well as non-coding genes.

ADD REPLY • link 3.9 years ago by Tm ★ 1.1k

0

Entering edit mode

When you are considering CDS, you will get less sequences with No hits compare to total genes as they are coding sequences. However complete gene set will include coding as well as non-coding genes.

ADD REPLY • link 3.9 years ago by Tm ★ 1.1k