Modyfing a Genbank file
1
0
Entering edit mode
4.3 years ago
matt81rd ▴ 10

Hi i am trying to search through a file for a specific list of words. If one of those words if found i want to add a newline underneath and add this phrase \colour = 1 (I don't want to remove the orginal word i am searching for).

An extract of the file for context and format:

> LOCUS       contig_2_pilon_pilon 5558986 bp    DNA     linear   BCT
> 16-JUN-2020 DEFINITION  Escherichia coli O157:H7 strain (270078)
> ACCESSION    VERSION KEYWORDS    . SOURCE      Escherichia coli 270078
> ORGANISM  Escherichia coli 270078
>             Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
>             Escherichia. COMMENT     Annotated using prokka 1.14.6 from
>             https://github.com/tseemann/prokka. FEATURES             Location/Qualifiers
>      source          1..5558986
>                      /organism="Escherichia coli 270078"
>                      /mol_type="genomic DNA"
>                      /strain="strain"
>                      /db_xref="taxon:562"
>      CDS             61523..61744
>                      /gene="pspD"
>                      /locus_tag="JCCJNNLA_00057"
>                      /inference="ab initio prediction:Prodigal:002006"
>                      /inference="similar to AA sequence:RefSeq:EG10779-MONOMER"
>                      /codon_start=1
>                      /transl_table=11
>                      /product="peripheral inner membrane heat-shock protein"
>                      /translation="MNTRWQQAGQKVKPGFKLAGKLVLLTALRYGPAGVAGWAIKSVA
>                      RRPLKMLLAVALEPLLSRAANKLAQRYKR"

Here is one of the lists of words i am looking for throughout the file:

regulation_list=["anti-repressor","anti-termination","antirepressor","antitermination","antiterminator","anti-terminator","cold-shock","cold shock","heat-shock","heat shock","regulation","regulator","regulatory","helicase","antibiotic resistance","repressor","zinc","sensor","dipeptidase","deacetylase","5-dehydrogenase","glucosamine kinase","glucosamine-kinase","dna-binding","dna binding","methylase","sulfurtransferase","acetyltransferase","control","ATP-binding","ATP binding","Cro","Ren protein","CII","inhibitor","activator","derepression","protein Sxy","sensing","sensor","Tir chaperone","Tir-cytoskeleton","Tir cytoskeleton","Tir protein","EspD"]

As you can see that extract contains one of th ephrases i am looking for and i want to add a newline underneath with the phrase /colour = 1

Any help would be great!

python genbank • 904 views
ADD COMMENT
0
Entering edit mode

if there are not too many you need to process you can open those kind of file(s) in a genome browser (apollo, artemis GenomeView, ... ) and change the color of the feature using the browser. afterwards you can then save the file again.

ADD REPLY
0
Entering edit mode
4.3 years ago
JC 13k

Perl solution:

#!/usr/bin/perl

use strict;

use warnings;

my @regulation_list = ("anti-repressor","anti-termination","antirepressor","antitermination","antiterminator","anti-terminator","cold-shock","cold shock","heat-shock","heat shock","regulation","regulator","regulatory","helicase","antibiotic resistance","repressor","zinc","sensor","dipeptidase","deacetylase","5-dehydrogenase","glucosamine kinase","glucosamine-kinase","dna-binding","dna binding","methylase","sulfurtransferase","acetyltransferase","control","ATP-binding","ATP binding","Cro","Ren protein","CII","inhibitor","activator","derepression","protein Sxy","sensing","sensor","Tir chaperone","Tir-cytoskeleton","Tir cytoskeleton","Tir protein","EspD");

my $list_regex = join "|", @regulation_list;

while (<>) {
    print;
    if (m|(\s+)/product=.*($list_regex)|i) {
        my $pre = $1;
        print "$pre/colour = 1\n";
    }
}

testing it:

$ perl checkWords.pl < file.gbk
LOCUS       contig_2_pilon_pilon 5558986 bp    DNA     linear   BCT
16-JUN-2020 DEFINITION  Escherichia coli O157:H7 strain (270078)
ACCESSION    VERSION KEYWORDS    . SOURCE      Escherichia coli 270078
ORGANISM  Escherichia coli 270078
            Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
            Escherichia. COMMENT     Annotated using prokka 1.14.6 from
            https://github.com/tseemann/prokka. FEATURES             Location/Qualifiers
     source          1..5558986
                     /organism="Escherichia coli 270078"
                     /mol_type="genomic DNA"
                     /strain="strain"
                     /db_xref="taxon:562"
     CDS             61523..61744
                     /gene="pspD"
                     /locus_tag="JCCJNNLA_00057"
                     /inference="ab initio prediction:Prodigal:002006"
                     /inference="similar to AA sequence:RefSeq:EG10779-MONOMER"
                     /codon_start=1
                     /transl_table=11
                     /product="peripheral inner membrane heat-shock protein"
                     /colour = 1
                     /translation="MNTRWQQAGQKVKPGFKLAGKLVLLTALRYGPAGVAGWAIKSVA
                     RRPLKMLLAVALEPLLSRAANKLAQRYKR"
ADD COMMENT

Login before adding your answer.

Traffic: 1579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6