Hello!
I am parsing a Genebank flat file from NCBI (the .gbff.gz found in here). To do this I'm using Bio::SeqIO and I am finding a problem in the very first CDS which looks like this
source 1..4480937
/organism="Alteromonas macleodii str. 'Deep ecotype'"
/mol_type="genomic DNA"
/strain="Deep ecotype"
/db_xref="taxon:314275"
gene join(4480899..4480937,1..1626)
/locus_tag="MADE_000001020675"
CDS join(4480899..4480937,1..1626)
/locus_tag="MADE_000001020675"
/inference="EXISTENCE: similar to AA
sequence:RefSeq:YP_006797084.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="chromosomal replication initiator protein DnaA"
/protein_id="AGV54147.1"
/db_xref="GI:543176245"
/translation="MSLWN...TLSS"
My concern is with this line:
CDS join(4480899..4480937,1..1626)
However, when I look at the nucleotide sequence that is outputed by Bio::SeqIO, it is outputting ~4.5 million nucleotides whereas if you look at the highlighted area it clearly shows that the coding sequence has ~1600ish nucleotides.What am I doing wrong?
Below you can find a snippet of the code that I am using. The reason why I'm using
next unless (($feat->primary_tag eq "CDS") and $feat->has_tag("translation"));
is that I am only interested in getting protein sequences and their corresponding gene sequences.
Thank you for all help!
#!/usr/local/bin/perl
use strict;
use warnings;
use autodie;
use Data::Dump;
use File::Basename;
use Bio::SeqIO;
use Data::Dumper;
[$gbffpath and other non-relevant stuff hidden]
opendir(DIR, "$gbffpath");
my @files= grep {/\.gbff$/ } readdir(DIR);
closedir(DIR);
my $k=1;
my ($i,$j);
foreach my $f (@files) {
$i=0;$j=0;
my $in = Bio::SeqIO -> new( -file => "<$gbffpath\\$f", -format => 'genbank');
while (my $seq = $in->next_seq) {
for my $feat ($seq -> get_SeqFeatures) {
next unless (($feat->primary_tag eq "CDS") and $feat->has_tag("translation"));
if ($feat->primary_tag eq "CDS") {#use to print DNA
print $geneout "\>";
# construct fasta header based on information present for each CDS:
print $geneout $feat->get_tag_values('locus_tag'),"\|" if ( $feat->has_tag('locus_tag') );
print $geneout $feat->get_tag_values('db_xref') ,"\|" if ( $feat->has_tag('db_xref') );
print $geneout $feat->get_tag_values('protein_id'),"\|" if ( $feat->has_tag('protein_id') );
print $geneout $feat->get_tag_values('gene'),"\|" if ( $feat->has_tag('gene') );
print $geneout $feat->get_tag_values('product'),"\|" if ( $feat->has_tag('product') );
print $protout $feat->get_tag_values('note'), if ($feat->has_tag('note') );
# print sequence:
print $geneout "\n",$feat->seq->seq,"\n";
$i++;
}
}
}
}
if (($i ne $j) or ($i==0) or ($j==0)){
print "\t\t$f\t$i CDS\t$j prots\t\n" ;
print "\t\t\>\>\>$f\t$i CDS\t$j prots\t\n";
}
else{
print "Extracted $i\/$j CDS\/proteins from file $f ($k/".($#files+1).")\n";
}
$k++;
}
It looks like the entire segment (
1..4480937
) is being printed. Hmmm, interesting!This is not interesting! it should work as I thought! Why can't computers do what I think? Why dear lord whyy?
(on topic, any suggestions? I'm very new to using this kind of things in Perl :( )
The part where you go
$feat->seq->seq
looks kinda fishy to me, given that the feature has noseq
component in the file format.Indeed! I confess that I am using an example that someone else gave me and I raised that question too. Nevertheless, I looked in the documentation of
Bio::SeqIO
in CPAN and they indicate that$feat->seq
is the way to get a sequence. Do you have any suggestion?I think I can guess why. They might just be getting the
MIN(ranges)
andMAX(ranges)
and going MIN -> MAX. In your case, unfortunately, that goes 1->END. Go for thespliced_seq()
and join them.Source: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation
Ah, nice it makes sense! I'll give it a try in a bit and will come back to let you know what I found! Thank you so much!
Example:
join(100..120, 130..140)
might become100..140
with their algojoin(END-10..END, 1..10)
on the other hand, becomes1..END
Again, this is just a guess, but try the
spliced_seq()
. It should definitely work.