Question

genbank parsing using perl

0

Entering edit mode

7.2 years ago

sharmatina189059 ▴ 110

I have genbank file. and want to retrieve relevant information. the problem is when the code fetches data related to product, it considers only one line information. since the while loop read data line by line.

#! /usr/local/bin/perl -w

open (GB,"$ARGV[0]");
open (AC, ">$ARGV[1]");


print AC "Locus_Tag\tProtein_ID\tProduct\tDB_Xref\n";

#print "Locus\tLength\tSequence name\tGI\tOrganism\tGene\tCDS\tTranslation\n";
while (<GB>)
{
    if (/(\/locus_tag=")(.*)(")/)
    {
        print AC "\n$2\t";
    }
  elsif (/(\/protein_id=")(.*)(")/)
   {
      print AC "$2\t";
   }   
    #elsif (/(\/strain=")(.*)(")/)
    #{
    #  print AC "$2\t";
    # }
    elsif (/(\/product=")(.*?)(")/)
    {
        print AC "$2\t";
    }
    elsif (/(\/db_xref=")(.*)(")/)
    {
        print AC "$2";
    }
    # elsif (/(\/translation=")(.*)(")/)
    # {
    # print AC "$2\t";
    #}

    # elsif (/(\/db_xref="UniProtKB)(.*)(")/)
    # {
    #print "$3\t";
    #}
   else
   {
      # Skip this data
   }
}
close AC;


close GB;

and if the product is mention in multiple line like this.

gene            171455..172330
                     /locus_tag="A1S_0148"
     CDS             171455..172330
                     /locus_tag="A1S_0148"
                     /codon_start=1
                     /transl_table=11
                     /product="membrane-bound ATP synthase F0 sector, subunit
                     a"

how to retrieve multiline data.

genome parse genbank file • 3.3k views

ADD COMMENT • link updated 21 months ago by Michael 55k • written 7.2 years ago by sharmatina189059 ▴ 110

score 0 · Answer 1 · 2017-09-07

Don't use a handwritten parser, use Bio::DB::Genbank instead.

The following code works to retrieve multi-line product tags, I cannot tell if it would have worked with the code-base 5 years ago though. In case somebody stumbles upon this.

#!/usr/bin/env perl
use strict;
use warnings;
use Bio::SeqIO;

my $in = Bio::SeqIO->new(-file=>"test.gbk", -format=>"genbank" );

while ( my $seq = $in->next_seq() ) {
       print "Sequence ",$seq->id, "\n";      
       map {print $_->get_tag_values("product"),"\n" if $_->has_tag('product')}
         $seq->get_all_SeqFeatures();        
}

Input:

LOCUS       AF177870     3123 bp    DNA             INV       31-OCT-1999
DEFINITION  Caenorhabditis remanei putative PP2C protein phosphatase FEM-2
            (fem-2) gene, complete cds.
ACCESSION   AF177870
NID         g6164828
VERSION     AF177870.1  GI:6164828
KEYWORDS    .
SOURCE      Caenorhabditis remanei.
  ORGANISM  Caenorhabditis remanei
            Eukaryota; Metazoa; Nematoda; Secernentea; Rhabditia; Rhabditida;
            Rhabditina; Rhabditoidea; Rhabditidae; Peloderinae; Caenorhabditis.
REFERENCE   1  (bases 1 to 3123)
  AUTHORS   Stothard,P.M., Hansen,D. and Pilgrim,D.
  TITLE     Isolation of PP2C sequences using degenerate-oligo PCR
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 3123)
  AUTHORS   Stothard,P.M., Hansen,D. and Pilgrim,D.
  TITLE     Direct Submission
  JOURNAL   Submitted (17-AUG-1999) Biological Sciences, University of Alberta,
            Edmonton, AB T6G-2E9, Canada
FEATURES             Location/Qualifiers
     source          1..3123
                     /organism="Caenorhabditis remanei"
                     /db_xref="taxon:31234"
     mRNA            join(<265..402,673..781,911..1007,1088..1215,
                     1377..1573,1866..2146,2306..2634,2683..>2855)
                     /gene="fem-2"
                     /product="putative FEM-2 protein 
                     phosphatase type 2C"
     gene            <265..>2855
                     /gene="fem-2"
     CDS             join(265..402,673..781,911..1007,1088..1215,1377..1573,
                     1866..2146,2306..2634,2683..2855)
                     /gene="fem-2"
                     /note="possible sex-determining protein"
                     /codon_start=1
                     /product="putative PP2C protein 
                     phosphatase FEM-2"
                     /protein_id="AAF04557.1"
                     /db_xref="PID:g6164829"
                     /db_xref="GI:6164829"
                     /translation="MSDSLNHPSSSTVHADDGFEPPTSPEDNNKKPSLEQIKQEREAL
                     FTDLFADRRRSARSVIEEAFQNELMSAEPVQPNVPNPHSIPIRFRHQPVAGPAHDVFG
                     DAVHSIFQKIMSRGVNADYSHWMSYWIALGIDKKTQMNYHMKPFCKDTYATEGSLEAK
                     QTFTDKIRSAVEEIIWKSAEYCDILSEKWTGIHVSADQLKGQRNKQEDRFVAYPNGQY
                     MNRGQSDISLLAVFDGHGGHECSQYAAAHFWEAWSDAQHHHSQDMKLDELLEKALETL
                     DERMTVRSVRESWKGGTTAVCCAVDLNTNQIAFAWLGDSPGYIMSNLEFRKFTTEHSP
                     SDPEECRRVEEVGGQIFVIGGELRVNGVLNLTRALGDVPGRPMISNKPDTLLKTIEPA
                     DYLVLLACDGISDVFNTSDLYNLVQAFVNEYDVEDYHELARYICNQAVSAGSADNVTV
                     VIGFLRPPEDVWRVMKTDSDDEESELEEEDDNE"
ORIGIN      
        1 gaacgcgaat gcctctctct ctttcgatgg gtatgccaat tgtccacatt cactcgtgtt

//

Output:

% perl test_genbank.pl
Sequence AF177870
putative FEM-2 protein phosphatase type 2C
putative PP2C protein phosphatase FEM-2