I am trying to annotate a genome for which I have a close reference. I have done annotation using DFAST and ended up with a genbank file like the one below. As you will see the first CDS has been annotated as a "hypothetical protein" and lacks a /gene name, whereas the second CDS has been annotated as "putative mobilization protein" and a /gene name has been given (BT_4758). I would like to have gene names for these "hypothetical proteins" as these make ~1/3 of the genome and I know some of these match at 100% percentage id with the reference. Thus, I used blastp to blast all proteins in my new genome against the reference and created a feature table like the one below.
For each /locus_tag in the genbank file I would like to first check if the /locus_tag already has a corresponding /gene. If /gene is present, do nothing. If false, find the corresponding gene name in the feature table and add it to the genbank file after the corresponding /locus_tag. I have been trying to find ways to do this but with limited success. Any pointers would be great.
Genbank
CDS 3205047..3205778
/product="hypothetical protein"
/inference="COORDINATES:ab initio
prediction:Prodigal:2.6.3"
/inference="similar to AA sequence:RefSeq:WP_011109414.1"
/transl_table=11
/codon_start=1
/translation="MTKIFGIYPTDRQESITFLNRINTYLCRKLDNQWHCYKIKYSNAD
HESCIKKAIDSNAKFILFMGHGRSDCLFGSCNKKSQDFIAEDAVIENPEFYRNEHFIHS
DNISKFKGKIFFSLSCLSNRNDTKSLARSAINNGVISFVGFGDIPTDYIVGKNIPLKAI
AIYKGIISKVIKISISISIQNNYTVEEMVSLIKVLTTKEIQKIILSPYKNRHKEIIVKN
LFLFKQEIMIFGNRYERLLYE"
/locus_tag="LOCUS_23770"
/note="WP_011109414.1 hypothetical protein (Bacteroides
thetaiotaomicron VPI-5482) [pid:95.1%, q_cov:100.0%,
s_cov:100.0%, Eval:1.2e-130]"
/note="OrthoSearch:AAO79862.1 hypothetical protein
(Bacteroides thetaiotaomicron VPI-5482) [pid:95.1%,
q_cov:100.0%, s_cov:100.0%, Eval:2.6e-132, RBH]"
/note="Prodigal_2381"
CDS complement(3205909..3207444)
/product="putative mobilization protein"
/inference="COORDINATES:ab initio
prediction:Prodigal:2.6.3"
/inference="similar to AA sequence:INSD:AAO79863.1"
/transl_table=11
/codon_start=1
/translation="MQETRLMENEYSINLPTRFWYRKKEWKGWINVVNPFRASMILGTP
GSGKSYAVVNNYIKQAIEKSYALYIYDFKFDDLSVIAYNHLIKYRHRYKIPPKFYVINF
DNPRKSHRCNPLAPELMTDISDAYESSYTIMLNLNKSWVQKQGDFFVESPIVLFTAIIW
FLKIYEGGKYCTFPHAIELLNKRYEDVFTILTSYPDLENYLSPFIDAWKGGASEQLQGQ
IASAKIPLSRLISPQLYWVMSGSDFTLDINNPKEPKVLCVGNNPDRISIYGAALGLYNS
RIVKLINKKKQLKSCVIIDELPTIFFKGLDNLIATARSNKVAVVLGFQDFSQLKRDYGD
KEAAVIMSTVGNVFSGQVVGETAKTLSERFGKILQKRESMSINRNDTSTSISTQLDSLI
PASKISTLSQGMFVGAVTDNFGETIDQKVFHAQIVVDNDAVQKETTSYQPIPEISSFLD
ENGNDTMEQQIQANYQQIKQDIVELVENELIRIENDPELKHLLGGDEGARAQA"
/locus_tag="LOCUS_23780"
/gene="BT_4758"
/note="OrthoSearch:AAO79863.1 putative mobilization protein
(Bacteroides thetaiotaomicron VPI-5482) [pid:99.6%,
q_cov:100.0%, s_cov:100.0%, Eval:7.0e-299, RBH]"
/note="COG:COG3505:VirD4 Type IV secretory pathway, VirD4
component, TraG/TraD family ATPase [Category:U,
Aligned:90-394, Eval:5.6e-11, score:62.0, N-term missing]"
/note="Prodigal_2382"
Feature table (query is my new assembly; subject is the reference genome)
Query ID Subject ID
LOCUS_00010 BT_4578
LOCUS_00020 BT_4577
LOCUS_00030 BT_2429