BioPerl retrieves joined sequence incorrectly
2
1
Entering edit mode
10.1 years ago
fr ▴ 220

Hello!

I am parsing a Genebank flat file from NCBI (the .gbff.gz found in here). To do this I'm using Bio::SeqIO and I am finding a problem in the very first CDS which looks like this

source          1..4480937
                     /organism="Alteromonas macleodii str. 'Deep ecotype'"
                     /mol_type="genomic DNA"
                     /strain="Deep ecotype"
                     /db_xref="taxon:314275"
gene            join(4480899..4480937,1..1626)
                     /locus_tag="MADE_000001020675"
CDS             join(4480899..4480937,1..1626)
                     /locus_tag="MADE_000001020675"
                     /inference="EXISTENCE: similar to AA
                     sequence:RefSeq:YP_006797084.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="chromosomal replication initiator protein DnaA"
                     /protein_id="AGV54147.1"
                     /db_xref="GI:543176245"
                     /translation="MSLWN...TLSS"

My concern is with this line:

CDS             join(4480899..4480937,1..1626)

However, when I look at the nucleotide sequence that is outputed by Bio::SeqIO, it is outputting ~4.5 million nucleotides whereas if you look at the highlighted area it clearly shows that the coding sequence has ~1600ish nucleotides.What am I doing wrong?

Below you can find a snippet of the code that I am using. The reason why I'm using

next unless (($feat->primary_tag eq "CDS") and $feat->has_tag("translation"));

is that I am only interested in getting protein sequences and their corresponding gene sequences.

Thank you for all help!

#!/usr/local/bin/perl
use strict;
use warnings;
use autodie;
use Data::Dump;
use File::Basename;
use Bio::SeqIO;
use Data::Dumper;

[$gbffpath and other non-relevant stuff hidden]

opendir(DIR, "$gbffpath");
my @files= grep {/\.gbff$/ } readdir(DIR);
closedir(DIR);

my $k=1;
my ($i,$j);
foreach my $f (@files) {
    $i=0;$j=0;
    my $in = Bio::SeqIO -> new( -file => "<$gbffpath\\$f", -format => 'genbank');

    while (my $seq = $in->next_seq) {
        for my $feat ($seq -> get_SeqFeatures) {
            next unless (($feat->primary_tag eq "CDS") and $feat->has_tag("translation"));
            if ($feat->primary_tag eq "CDS") {#use to print DNA

              print $geneout "\>";
              # construct fasta header based on information present for each CDS:
              print $geneout $feat->get_tag_values('locus_tag'),"\|" if ( $feat->has_tag('locus_tag') );    
              print $geneout $feat->get_tag_values('db_xref') ,"\|" if ( $feat->has_tag('db_xref') );    
              print $geneout $feat->get_tag_values('protein_id'),"\|" if ( $feat->has_tag('protein_id') );    
              print $geneout $feat->get_tag_values('gene'),"\|" if ( $feat->has_tag('gene') ); 
              print $geneout $feat->get_tag_values('product'),"\|" if ( $feat->has_tag('product') );
              print $protout $feat->get_tag_values('note'), if ($feat->has_tag('note') );

              # print sequence:
              print $geneout "\n",$feat->seq->seq,"\n";
              $i++;
              }
            }
        }
    }

    if (($i ne $j) or ($i==0) or ($j==0)){
      print "\t\t$f\t$i CDS\t$j prots\t\n" ;
      print "\t\t\>\>\>$f\t$i CDS\t$j prots\t\n";
    }
    else{
      print "Extracted $i\/$j CDS\/proteins from file $f ($k/".($#files+1).")\n";
    }
    $k++;
}
genome perl bioperl gene cds • 3.5k views
ADD COMMENT
0
Entering edit mode

It looks like the entire segment (1..4480937) is being printed. Hmmm, interesting!

ADD REPLY
0
Entering edit mode

This is not interesting! it should work as I thought! Why can't computers do what I think? Why dear lord whyy?

(on topic, any suggestions? I'm very new to using this kind of things in Perl :( )

ADD REPLY
0
Entering edit mode

The part where you go $feat->seq->seq looks kinda fishy to me, given that the feature has no seq component in the file format.

ADD REPLY
0
Entering edit mode

Indeed! I confess that I am using an example that someone else gave me and I raised that question too. Nevertheless, I looked in the documentation of Bio::SeqIO in CPAN and they indicate that $feat->seq is the way to get a sequence. Do you have any suggestion?

ADD REPLY
1
Entering edit mode

I think I can guess why. They might just be getting the MIN(ranges) and MAX(ranges) and going MIN -> MAX. In your case, unfortunately, that goes 1->END. Go for the spliced_seq() and join them.

Source: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

ADD REPLY
0
Entering edit mode

Ah, nice it makes sense! I'll give it a try in a bit and will come back to let you know what I found! Thank you so much!

ADD REPLY
1
Entering edit mode

Example: join(100..120, 130..140) might become 100..140 with their algo

join(END-10..END, 1..10) on the other hand, becomes 1..END

Again, this is just a guess, but try the spliced_seq(). It should definitely work.

ADD REPLY
3
Entering edit mode
10.1 years ago
Ram 44k

Given how the seq() function is supposed to start from the beginning of the first range and go to the end of the last, you have an edge case here where your sequence fragments are at the ends. The method sees your 1..1626 and 4480899..4480937 and goes 1..4480937

I think you'd be better off using spliced_seq() and joining the result from that call.

Source: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

ADD COMMENT
0
Entering edit mode

Using $feat->spliced_seq->seq instead of $feat->spliced_seq->seq solved the problem! Thanks!

ADD REPLY
0
Entering edit mode

Great. Glad I could help. Istvan's solution is better, though. It comes from experience.

ADD REPLY
2
Entering edit mode
10.1 years ago

Personally I suspect that the problem is that it lists the large values first (since it is a circular genome) instead of a natural, sorted order that most genomes would have.

You can use the readseq tool to perform the same task

readseq -format=fasta -feat=gene G.gb

In general I would urge a lot of caution when whipping up small perl scripts to these tasks, GeneBank files can be deceivingly complicated with the joins and complements.

ADD COMMENT

Login before adding your answer.

Traffic: 2278 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6