Perl. How to retrieve data until a keyword in GenBank?
1
0
Entering edit mode
17 months ago

Hello! I'm trying to write a script that retrieves data from GenBank files. I only need the info until the COMMENT part of the annotation.

This is my INPUT:

LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
            Ensembl or EnsemblGenomes web site, http://www.ensembl.org/ or
            http://www.ensemblgenomes.org/ for more information.

The current script:

$genbank = <STDIN>;
chomp ($genbank);
open (READ, "<$genbank") or die;
@data = <READ>;
close READ;

$end= $#data;
for ($line= 0; $line<= $end; $line++){
    if ($data[$rand] =~ /LOCUS/){
        @annotation = (@annotation, $data[$line]);
        until ($data[$line] =~ /COMMENT/){
            $line++;
            @annotation = (@annotation, $data[$line]);
}}}
print @annotation;

And its OUTPUT:

LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the

As you can see, there's an issue with this method.

How can I modify the code so it retrieves the data but stops at COMMENT and doesn't retrieve the entire line with it?

The first line of all GenBank files starts with LOCUS and I suppose this can be used to write a better code (without the need of a regex match with LOCUS). I'm clueless on how it can be done though. I will really appreciate your input!!

fasta sequence GenBank perl • 1.0k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
2
Entering edit mode
17 months ago
JC 13k

The trick over here is to start capturing when you found "LOCUS" and stop when you found "COMMENT", something like:

#!/usr/bin/perl

use strict;
use warnings;

my $data = '';
my $rec = 0; # simple flag to know when to capture and stop
while (<>) {  # don't read the full file, you can just read line by line

    $rec = 1 if /(/^LOCUS/); # switch the flag on LOCUS 

    if (/^COMMENT/) {
        # turn off the switch, print and clean
        $rec = 0; 
        print $data; 
        $data = '';
    }

    $data .= $_ if ($rec == 1); # append data
}

Then you can use it as

perl parser.pl < genbank_file > output

ADD COMMENT
0
Entering edit mode

THANK YOU!! Your method works!!

And by removing:

$data = '';

I figured out how to retrieve the annotation info with a prompt:

print "Name your new file (.txt):\n";
$file = <STDIN>;
open (WRITE, ">$file") or die "$!\n";
print WRITE "$data";
close (WRITE);
close (READ);

This was very useful!

ADD REPLY

Login before adding your answer.

Traffic: 2500 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6