genbank parsing using perl
1
0
Entering edit mode
7.2 years ago

I have genbank file. and want to retrieve relevant information. the problem is when the code fetches data related to product, it considers only one line information. since the while loop read data line by line.

#! /usr/local/bin/perl -w

open (GB,"$ARGV[0]");
open (AC, ">$ARGV[1]");


print AC "Locus_Tag\tProtein_ID\tProduct\tDB_Xref\n";

#print "Locus\tLength\tSequence name\tGI\tOrganism\tGene\tCDS\tTranslation\n";
while (<GB>)
{
    if (/(\/locus_tag=")(.*)(")/)
    {
        print AC "\n$2\t";
    }
  elsif (/(\/protein_id=")(.*)(")/)
   {
      print AC "$2\t";
   }   
    #elsif (/(\/strain=")(.*)(")/)
    #{
    #  print AC "$2\t";
    # }
    elsif (/(\/product=")(.*?)(")/)
    {
        print AC "$2\t";
    }
    elsif (/(\/db_xref=")(.*)(")/)
    {
        print AC "$2";
    }
    # elsif (/(\/translation=")(.*)(")/)
    # {
    # print AC "$2\t";
    #}

    # elsif (/(\/db_xref="UniProtKB)(.*)(")/)
    # {
    #print "$3\t";
    #}
   else
   {
      # Skip this data
   }
}
close AC;


close GB;

and if the product is mention in multiple line like this.

gene            171455..172330
                     /locus_tag="A1S_0148"
     CDS             171455..172330
                     /locus_tag="A1S_0148"
                     /codon_start=1
                     /transl_table=11
                     /product="membrane-bound ATP synthase F0 sector, subunit
                     a"

how to retrieve multiline data.

genome parse genbank file • 3.3k views
ADD COMMENT
0
Entering edit mode
7.2 years ago
Michael 55k

Don't use a handwritten parser, use Bio::DB::Genbank instead.

The following code works to retrieve multi-line product tags, I cannot tell if it would have worked with the code-base 5 years ago though. In case somebody stumbles upon this.


#!/usr/bin/env perl
use strict;
use warnings;
use Bio::SeqIO;

my $in = Bio::SeqIO->new(-file=>"test.gbk", -format=>"genbank" );

while ( my $seq = $in->next_seq() ) {
       print "Sequence ",$seq->id, "\n";      
       map {print $_->get_tag_values("product"),"\n" if $_->has_tag('product')}
         $seq->get_all_SeqFeatures();        
}

Input:

LOCUS       AF177870     3123 bp    DNA             INV       31-OCT-1999
DEFINITION  Caenorhabditis remanei putative PP2C protein phosphatase FEM-2
            (fem-2) gene, complete cds.
ACCESSION   AF177870
NID         g6164828
VERSION     AF177870.1  GI:6164828
KEYWORDS    .
SOURCE      Caenorhabditis remanei.
  ORGANISM  Caenorhabditis remanei
            Eukaryota; Metazoa; Nematoda; Secernentea; Rhabditia; Rhabditida;
            Rhabditina; Rhabditoidea; Rhabditidae; Peloderinae; Caenorhabditis.
REFERENCE   1  (bases 1 to 3123)
  AUTHORS   Stothard,P.M., Hansen,D. and Pilgrim,D.
  TITLE     Isolation of PP2C sequences using degenerate-oligo PCR
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 3123)
  AUTHORS   Stothard,P.M., Hansen,D. and Pilgrim,D.
  TITLE     Direct Submission
  JOURNAL   Submitted (17-AUG-1999) Biological Sciences, University of Alberta,
            Edmonton, AB T6G-2E9, Canada
FEATURES             Location/Qualifiers
     source          1..3123
                     /organism="Caenorhabditis remanei"
                     /db_xref="taxon:31234"
     mRNA            join(<265..402,673..781,911..1007,1088..1215,
                     1377..1573,1866..2146,2306..2634,2683..>2855)
                     /gene="fem-2"
                     /product="putative FEM-2 protein 
                     phosphatase type 2C"
     gene            <265..>2855
                     /gene="fem-2"
     CDS             join(265..402,673..781,911..1007,1088..1215,1377..1573,
                     1866..2146,2306..2634,2683..2855)
                     /gene="fem-2"
                     /note="possible sex-determining protein"
                     /codon_start=1
                     /product="putative PP2C protein 
                     phosphatase FEM-2"
                     /protein_id="AAF04557.1"
                     /db_xref="PID:g6164829"
                     /db_xref="GI:6164829"
                     /translation="MSDSLNHPSSSTVHADDGFEPPTSPEDNNKKPSLEQIKQEREAL
                     FTDLFADRRRSARSVIEEAFQNELMSAEPVQPNVPNPHSIPIRFRHQPVAGPAHDVFG
                     DAVHSIFQKIMSRGVNADYSHWMSYWIALGIDKKTQMNYHMKPFCKDTYATEGSLEAK
                     QTFTDKIRSAVEEIIWKSAEYCDILSEKWTGIHVSADQLKGQRNKQEDRFVAYPNGQY
                     MNRGQSDISLLAVFDGHGGHECSQYAAAHFWEAWSDAQHHHSQDMKLDELLEKALETL
                     DERMTVRSVRESWKGGTTAVCCAVDLNTNQIAFAWLGDSPGYIMSNLEFRKFTTEHSP
                     SDPEECRRVEEVGGQIFVIGGELRVNGVLNLTRALGDVPGRPMISNKPDTLLKTIEPA
                     DYLVLLACDGISDVFNTSDLYNLVQAFVNEYDVEDYHELARYICNQAVSAGSADNVTV
                     VIGFLRPPEDVWRVMKTDSDDEESELEEEDDNE"
ORIGIN      
        1 gaacgcgaat gcctctctct ctttcgatgg gtatgccaat tgtccacatt cactcgtgtt

//

Output:

% perl test_genbank.pl
Sequence AF177870
putative FEM-2 protein phosphatase type 2C
putative PP2C protein phosphatase FEM-2
ADD COMMENT
0
Entering edit mode

i want to use this program. Just wanted to get complete product name but could not fetch with this regex.

ADD REPLY
0
Entering edit mode

Genbank is a notoriously difficult format to handle computationally. BioPerl has all of the tools you need to retrieve the product/feature fully. I don't know perl well, but it will probably take about 10 lines of code.

ADD REPLY
0
Entering edit mode

But my code works well with the single line data of genbank file. Anyways, thanks for reply.

ADD REPLY
0
Entering edit mode

Your code does not work correctly and causes problems or the format does, as you have experienced, therefore there is no good reason of using it. 'I want' is not a good argument in this case. If you want to look how to parse it correctly the only thing I could recommend is to look into the code of the perl package.

ADD REPLY
1
Entering edit mode

This was an unhelpful comment if there ever was one. Github indicates that the bioperl converter has had many bugs patched since the time this comment was posted and still fails to properly handle chromosomes with repeated genes nearly six years later (which is how I came across this thread). Even the tools NCBI uses to convert their genbanks to GFF work better than the bioperl script, a shame we cannot access them.

ADD REPLY
0
Entering edit mode

I agree that this answer wasn't really helpful and I apologize for not meeting the user's expectations and for any confusion caused. I should have at least provided a working code example and also should have asked for a reproducible input example. For the sake of completeness, I will add a code example to the original answer.

ADD REPLY
0
Entering edit mode

Also, it should have been mentioned that the reason for the OP's program not working as intended is that it parses the input statelessly in a line-by-line mode. That way, it is difficult to parse multi-line blocks, because a line will not contain a complete block. An option would be to read complete records into memory and then use regex modifier /m or /s to parse multi-line regex, e.g.: /(\/product=")(.*?)"/m. Another option would be to implement a simple state automaton and read ahead upon entering a multi-line block.

ADD REPLY

Login before adding your answer.

Traffic: 1663 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6