Hello! I'm trying to write a script that retrieves data from GenBank files. I only need the info until the COMMENT part of the annotation.
This is my INPUT:
LOCUS mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
sequence 1..19524 reannotated via EnsEMBL
ACCESSION primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION mitochondrion_genomeBDGP6.32
KEYWORDS .
SOURCE fruit fly
ORGANISM Drosophila melanogaster
Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
Drosophilidae; Drosophilinae.
COMMENT This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
Ensembl or EnsemblGenomes web site, http://www.ensembl.org/ or
http://www.ensemblgenomes.org/ for more information.
The current script:
$genbank = <STDIN>;
chomp ($genbank);
open (READ, "<$genbank") or die;
@data = <READ>;
close READ;
$end= $#data;
for ($line= 0; $line<= $end; $line++){
if ($data[$rand] =~ /LOCUS/){
@annotation = (@annotation, $data[$line]);
until ($data[$line] =~ /COMMENT/){
$line++;
@annotation = (@annotation, $data[$line]);
}}}
print @annotation;
And its OUTPUT:
LOCUS mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
sequence 1..19524 reannotated via EnsEMBL
ACCESSION primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION mitochondrion_genomeBDGP6.32
KEYWORDS .
SOURCE fruit fly
ORGANISM Drosophila melanogaster
Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
Drosophilidae; Drosophilinae.
COMMENT This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
As you can see, there's an issue with this method.
How can I modify the code so it retrieves the data but stops at COMMENT and doesn't retrieve the entire line with it?
The first line of all GenBank files starts with LOCUS and I suppose this can be used to write a better code (without the need of a regex match with LOCUS). I'm clueless on how it can be done though. I will really appreciate your input!!
cross posted: https://stackoverflow.com/questions/76405846