How To Scan Genbank Records And Extract Information From A File Using Perl
1
0
Entering edit mode
11.8 years ago
wendy • 0

Hi everyone,

I have a file containing many GenBank records. I want to scan each of the GenBank record, if the record contains keywords, I want to print the accession number. I have looked at an example and tried to modify it as follows:

#!/usr/bin/perl 
use strict;
use warnings;
use BeginPerlBioinfo;

my $annotation;
my %fields;
my @genbank =();
my $locus = '';
my $accession = '';
my $reference = '';
my @features = ();
@genbank = get_file_data ('all_2.txt');

for my $line (@genbank){
    if($line =~/^LOCUS/){
         $line =~ s/^LOCUS\s*//;
         $locus = $line ;
         print $locus ;
   }elsif ($line =~/^ACCESSION/){
         $line =~ s/^ACCESSION\s*//;
         $accession = $line ;
    }elsif ($line =~/^REFERENCE/){
         $line =~ s/^REFERENCE\s*//m;
         $reference = $line;
         print $reference;
   }elsif ($line =~/^FEATURES/){
         %fields = parse_annotation($annotation);
         @features = parse_features($fields {'FEATURES'});
  foreach my $feature (@features) {
         my ($featurename) = ($feature =~ /^{5}(\S+)/);
         print $feature;
  if ($locus=~ /keywords/i) || ($reference=~ /keywords/i) || ($feature=~ /keywords/i){
     print $accession;
      }
    } 
  }
}

sub parse_features {
my ($features) =@_;
my (@features) = ();
while ($features =~/^{5}\S.*\n(^{21}\S.*\n)*/gm){
  my $feature = $&;
push (@features, $feature);
  }
return @features;
}
exit;

I can print locus and accession number, for the reference and feature part I can only print the first line. At the same time, I got the errors:


*Quantifier unexpected on zero-length expression in regex; marked by <-- HERE in m/ ^{5}(\S+)/< --HERE

Quantifier unexpected on zero-length expression in regex; marked by <-- HERE in/^{5}\S.\n(^{21}\S.\n)* <-- HERE**

I know it is something related with the regular expression but I do not know how to solve it as I am just the beginner. May I know how should I solve this so that I can print the whole part of reference and features instead of just the first line of them? Thanks.

genbank perl • 6.2k views
ADD COMMENT
1
Entering edit mode

Use Bioperl! There are already excellent libraries for parsing GenBank files that are already able to extract all the features and you don't have to worry about low-level text parsing (which is always a good thing to avoid): http://www.bioperl.org/wiki/HOWTO:SeqIO

ADD REPLY
0
Entering edit mode

The {n} notation says how many times to match the preceeding character. However, you have nothing ("zero-length") before those quantifiers, thus the errors of that type. If you have some data to share that you're trying to match, I'm sure someone here will assist. In the meantime, try regex101.com, as it's a good place to hone your regex skills.

BTW - Avoid using $& as it's costly; use ${^MATCH} instead.

ADD REPLY
1
Entering edit mode
11.8 years ago

As mentioned by Micheal you should consider using BioPerl.

Here is some example code: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

Here is a chapter providing an intro to BioPerl: Perl Programming for Bioinformatics - Chapter 9

And another book chapter in Beginning Perl for Bioinformatics that has some relevant example code in Chapter 10 (although not BioPerl in this case).

ADD COMMENT
0
Entering edit mode

+1 for the feature annotation link.

ADD REPLY

Login before adding your answer.

Traffic: 2002 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6