Howto transfer all 'product' qualifiers from a genbank file to another one
1
0
Entering edit mode
3.9 years ago
a99942 • 0

Hi,

I have 2 genbank files of almost identical genome assemblies with the same number of predicted genes with similar AA sequences, one genome per file:

file1: with assembly gaps but manually curated annotation, i.e. custom 'product' qualifiers

 CDS             complement(551913..552146)
                     /locus_tag="KLOVNAE_00475"
                     /inference="ab initio prediction:Prodigal:002006"
                     /codon_start=1
                     /product="Ubiquitin-like homologous superfamily
                     domain-containing protein"
                     ...

file2: a correct assembly with automated prokka annotation only, that has some other qualifiers I would like to keep in the final genbank file, i. e. 'gene', 'inference' ...

 CDS             complement(551913..552146)
                     /gene="vUbi_1"
                     /locus_tag="KLOVNAE_00475"
                     /inference="ab initio prediction:Prodigal:002006"
                     /inference="similar to AA sequence:UniProtKB:P16709"
                     /codon_start=1
                     /product="viral Ubiquitin"
                     ...

Therefore I would like to replace the 'product' qualifier of all CDS features in file2 with the respective ones from the manually curated file 1.

How could I achieve that eg. in Biopython? Hints/Help would be highly appreciated!

thx atb flo

genbank biopython • 1.0k views
ADD COMMENT
0
Entering edit mode

do both files use the same geneIDs ? (== can you link on gene name or such?)

ADD REPLY
0
Entering edit mode

all genes of both files have the same locus_tags: KLOVNAE_00001 ... 01000

I have included examples of the CDS features...

ADD REPLY
2
Entering edit mode
3.9 years ago
Joe 21k

This can be achieved with biopython I think, though may not be super straightforward. I don't have time at the moment to throw together actual code, but off the top of my head this is the approach you could use (assume the features correspond very well (identical order, locus tags etc).

from Bio import SeqIO

source_gb = SeqIO.read("genbank1.gbk", "genbank")   # Note: 'read' assumes each file isn't a multi-genbank else this gets more complicated
dest_gb = SeqIO.read("genbank2.gbk", "genbank")

#iterate the list of features in each genbank:
for source_feat, dest_feat in zip((s_feature for s_feature in source_gb.features), (d_feature for d_feature in dest.features)):
    #sub the relevant qualified from one record in to the other. This is where I forget the exact syntax, but something like:
    dest_feat.qualifiers['product'] = source_feat.qualifiers['product']

You might want to add extra checks such as source_feat.qualifiers['locus_tag'] == dest_feat.qualifiers['locus_tag'] or something.

ADD COMMENT
0
Entering edit mode

it worked after checking for the presence of 'product'! since there were 2 repeat regions annotated a crispr regions without the qualifier 'product'... many thx!

from Bio import SeqIO

source_gb = SeqIO.read("file1.gbk", "genbank")
dest_gb = SeqIO.read("file2.gbk", "genbank")

#iterate the list of features in each genbank:
for source_feat, dest_feat in zip((s_feature for s_feature in source_gb.features), (d_feature for d_feature in dest_gb.features)):
    if (source_feat.qualifiers.get("product")):
      dest_feat.qualifiers["product"] = source_feat.qualifiers["product"]

SeqIO.write(dest_gb, "dest.gbf", "genbank")
ADD REPLY
0
Entering edit mode

Great! Glad it worked :)

moved to an answer so you can go ahead and accept if you like.

ADD REPLY

Login before adding your answer.

Traffic: 2193 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6