Question

Howto transfer all 'product' qualifiers from a genbank file to another one

0

Entering edit mode

3.9 years ago

a99942 • 0

Hi,

I have 2 genbank files of almost identical genome assemblies with the same number of predicted genes with similar AA sequences, one genome per file:

file1: with assembly gaps but manually curated annotation, i.e. custom 'product' qualifiers

 CDS             complement(551913..552146)
                     /locus_tag="KLOVNAE_00475"
                     /inference="ab initio prediction:Prodigal:002006"
                     /codon_start=1
                     /product="Ubiquitin-like homologous superfamily
                     domain-containing protein"
                     ...

file2: a correct assembly with automated prokka annotation only, that has some other qualifiers I would like to keep in the final genbank file, i. e. 'gene', 'inference' ...

 CDS             complement(551913..552146)
                     /gene="vUbi_1"
                     /locus_tag="KLOVNAE_00475"
                     /inference="ab initio prediction:Prodigal:002006"
                     /inference="similar to AA sequence:UniProtKB:P16709"
                     /codon_start=1
                     /product="viral Ubiquitin"
                     ...

Therefore I would like to replace the 'product' qualifier of all CDS features in file2 with the respective ones from the manually curated file 1.

How could I achieve that eg. in Biopython? Hints/Help would be highly appreciated!

thx atb flo

genbank biopython • 1.0k views

ADD COMMENT • link 3.9 years ago by a99942 • 0

0

Entering edit mode

do both files use the same geneIDs ? (== can you link on gene name or such?)

ADD REPLY • link 3.9 years ago by lieven.sterck 15k

0

Entering edit mode

all genes of both files have the same locus_tags: KLOVNAE_00001 ... 01000

I have included examples of the CDS features...

ADD REPLY • link 3.9 years ago by a99942 • 0

score 2 · Accepted Answer · 2021-02-03

2

Entering edit mode

3.9 years ago

Joe 21k

This can be achieved with biopython I think, though may not be super straightforward. I don't have time at the moment to throw together actual code, but off the top of my head this is the approach you could use (assume the features correspond very well (identical order, locus tags etc).

from Bio import SeqIO

source_gb = SeqIO.read("genbank1.gbk", "genbank")   # Note: 'read' assumes each file isn't a multi-genbank else this gets more complicated
dest_gb = SeqIO.read("genbank2.gbk", "genbank")

#iterate the list of features in each genbank:
for source_feat, dest_feat in zip((s_feature for s_feature in source_gb.features), (d_feature for d_feature in dest.features)):
    #sub the relevant qualified from one record in to the other. This is where I forget the exact syntax, but something like:
    dest_feat.qualifiers['product'] = source_feat.qualifiers['product']

You might want to add extra checks such as source_feat.qualifiers['locus_tag'] == dest_feat.qualifiers['locus_tag'] or something.

ADD COMMENT • link 3.9 years ago by Joe 21k

0

Entering edit mode

it worked after checking for the presence of 'product'! since there were 2 repeat regions annotated a crispr regions without the qualifier 'product'... many thx!

from Bio import SeqIO

source_gb = SeqIO.read("file1.gbk", "genbank")
dest_gb = SeqIO.read("file2.gbk", "genbank")

#iterate the list of features in each genbank:
for source_feat, dest_feat in zip((s_feature for s_feature in source_gb.features), (d_feature for d_feature in dest_gb.features)):
    if (source_feat.qualifiers.get("product")):
      dest_feat.qualifiers["product"] = source_feat.qualifiers["product"]

SeqIO.write(dest_gb, "dest.gbf", "genbank")

ADD REPLY • link 3.9 years ago by a99942 • 0

0

Entering edit mode

Great! Glad it worked :)

moved to an answer so you can go ahead and accept if you like.

ADD REPLY • link 3.9 years ago by Joe 21k