Hi everyone,
I have a file containing many GenBank records. I want to scan each of the GenBank record, if the record contains keywords, I want to print the accession number. I have looked at an example and tried to modify it as follows:
#!/usr/bin/perl
use strict;
use warnings;
use BeginPerlBioinfo;
my $annotation;
my %fields;
my @genbank =();
my $locus = '';
my $accession = '';
my $reference = '';
my @features = ();
@genbank = get_file_data ('all_2.txt');
for my $line (@genbank){
if($line =~/^LOCUS/){
$line =~ s/^LOCUS\s*//;
$locus = $line ;
print $locus ;
}elsif ($line =~/^ACCESSION/){
$line =~ s/^ACCESSION\s*//;
$accession = $line ;
}elsif ($line =~/^REFERENCE/){
$line =~ s/^REFERENCE\s*//m;
$reference = $line;
print $reference;
}elsif ($line =~/^FEATURES/){
%fields = parse_annotation($annotation);
@features = parse_features($fields {'FEATURES'});
foreach my $feature (@features) {
my ($featurename) = ($feature =~ /^{5}(\S+)/);
print $feature;
if ($locus=~ /keywords/i) || ($reference=~ /keywords/i) || ($feature=~ /keywords/i){
print $accession;
}
}
}
}
sub parse_features {
my ($features) =@_;
my (@features) = ();
while ($features =~/^{5}\S.*\n(^{21}\S.*\n)*/gm){
my $feature = $&;
push (@features, $feature);
}
return @features;
}
exit;
I can print locus and accession number, for the reference and feature part I can only print the first line. At the same time, I got the errors:
*Quantifier unexpected on zero-length expression in regex; marked by <-- HERE in m/ ^{5}(\S+)/< --HERE
Quantifier unexpected on zero-length expression in regex; marked by <-- HERE in/^{5}\S.\n(^{21}\S.\n)* <-- HERE**
I know it is something related with the regular expression but I do not know how to solve it as I am just the beginner. May I know how should I solve this so that I can print the whole part of reference and features instead of just the first line of them? Thanks.
Use Bioperl! There are already excellent libraries for parsing GenBank files that are already able to extract all the features and you don't have to worry about low-level text parsing (which is always a good thing to avoid): http://www.bioperl.org/wiki/HOWTO:SeqIO
The
{n}
notation says how many times to match the preceeding character. However, you have nothing ("zero-length") before those quantifiers, thus the errors of that type. If you have some data to share that you're trying to match, I'm sure someone here will assist. In the meantime, try regex101.com, as it's a good place to hone your regex skills.BTW - Avoid using
$&
as it's costly; use${^MATCH}
instead.