Question

BioPerl retrieves joined sequence incorrectly

1

Entering edit mode

10.1 years ago

fr ▴ 220

Hello!

I am parsing a Genebank flat file from NCBI (the .gbff.gz found in here). To do this I'm using Bio::SeqIO and I am finding a problem in the very first CDS which looks like this

source          1..4480937
                     /organism="Alteromonas macleodii str. 'Deep ecotype'"
                     /mol_type="genomic DNA"
                     /strain="Deep ecotype"
                     /db_xref="taxon:314275"
gene            join(4480899..4480937,1..1626)
                     /locus_tag="MADE_000001020675"
CDS             join(4480899..4480937,1..1626)
                     /locus_tag="MADE_000001020675"
                     /inference="EXISTENCE: similar to AA
                     sequence:RefSeq:YP_006797084.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="chromosomal replication initiator protein DnaA"
                     /protein_id="AGV54147.1"
                     /db_xref="GI:543176245"
                     /translation="MSLWN...TLSS"

My concern is with this line:

CDS             join(4480899..4480937,1..1626)

However, when I look at the nucleotide sequence that is outputed by Bio::SeqIO, it is outputting ~4.5 million nucleotides whereas if you look at the highlighted area it clearly shows that the coding sequence has ~1600ish nucleotides.What am I doing wrong?

Below you can find a snippet of the code that I am using. The reason why I'm using

next unless (($feat->primary_tag eq "CDS") and $feat->has_tag("translation"));

is that I am only interested in getting protein sequences and their corresponding gene sequences.

Thank you for all help!

#!/usr/local/bin/perl
use strict;
use warnings;
use autodie;
use Data::Dump;
use File::Basename;
use Bio::SeqIO;
use Data::Dumper;

[$gbffpath and other non-relevant stuff hidden]

opendir(DIR, "$gbffpath");
my @files= grep {/\.gbff$/ } readdir(DIR);
closedir(DIR);

my $k=1;
my ($i,$j);
foreach my $f (@files) {
    $i=0;$j=0;
    my $in = Bio::SeqIO -> new( -file => "<$gbffpath\\$f", -format => 'genbank');

    while (my $seq = $in->next_seq) {
        for my $feat ($seq -> get_SeqFeatures) {
            next unless (($feat->primary_tag eq "CDS") and $feat->has_tag("translation"));
            if ($feat->primary_tag eq "CDS") {#use to print DNA

              print $geneout "\>";
              # construct fasta header based on information present for each CDS:
              print $geneout $feat->get_tag_values('locus_tag'),"\|" if ( $feat->has_tag('locus_tag') );    
              print $geneout $feat->get_tag_values('db_xref') ,"\|" if ( $feat->has_tag('db_xref') );    
              print $geneout $feat->get_tag_values('protein_id'),"\|" if ( $feat->has_tag('protein_id') );    
              print $geneout $feat->get_tag_values('gene'),"\|" if ( $feat->has_tag('gene') ); 
              print $geneout $feat->get_tag_values('product'),"\|" if ( $feat->has_tag('product') );
              print $protout $feat->get_tag_values('note'), if ($feat->has_tag('note') );

              # print sequence:
              print $geneout "\n",$feat->seq->seq,"\n";
              $i++;
              }
            }
        }
    }

    if (($i ne $j) or ($i==0) or ($j==0)){
      print "\t\t$f\t$i CDS\t$j prots\t\n" ;
      print "\t\t\>\>\>$f\t$i CDS\t$j prots\t\n";
    }
    else{
      print "Extracted $i\/$j CDS\/proteins from file $f ($k/".($#files+1).")\n";
    }
    $k++;
}

genome perl bioperl gene cds • 3.5k views

ADD COMMENT • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by fr ▴ 220

0

Entering edit mode

It looks like the entire segment (1..4480937) is being printed. Hmmm, interesting!

ADD REPLY • link 3.7 years ago by Ram 44k

0

Entering edit mode

This is not interesting! it should work as I thought! Why can't computers do what I think? Why dear lord whyy?

(on topic, any suggestions? I'm very new to using this kind of things in Perl :( )

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by fr ▴ 220

0

Entering edit mode

The part where you go $feat->seq->seq looks kinda fishy to me, given that the feature has no seq component in the file format.

ADD REPLY • link 3.7 years ago by Ram 44k

0

Entering edit mode

Indeed! I confess that I am using an example that someone else gave me and I raised that question too. Nevertheless, I looked in the documentation of Bio::SeqIO in CPAN and they indicate that $feat->seq is the way to get a sequence. Do you have any suggestion?

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by fr ▴ 220

1

Entering edit mode

I think I can guess why. They might just be getting the MIN(ranges) and MAX(ranges) and going MIN -> MAX. In your case, unfortunately, that goes 1->END. Go for the spliced_seq() and join them.

Source: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

ADD REPLY • link 3.7 years ago by Ram 44k

0

Entering edit mode

Ah, nice it makes sense! I'll give it a try in a bit and will come back to let you know what I found! Thank you so much!

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by fr ▴ 220

1

Entering edit mode

Example: join(100..120, 130..140) might become 100..140 with their algo

join(END-10..END, 1..10) on the other hand, becomes 1..END

Again, this is just a guess, but try the spliced_seq(). It should definitely work.

ADD REPLY • link 3.7 years ago by Ram 44k

2

Entering edit mode

10.1 years ago

Istvan Albert 102k

Personally I suspect that the problem is that it lists the large values first (since it is a circular genome) instead of a natural, sorted order that most genomes would have.

You can use the readseq tool to perform the same task

readseq -format=fasta -feat=gene G.gb

In general I would urge a lot of caution when whipping up small perl scripts to these tasks, GeneBank files can be deceivingly complicated with the joins and complements.

ADD COMMENT • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by Istvan Albert 102k

Ram · Accepted Answer · 2014-10-20

3

Entering edit mode

10.1 years ago

Ram 44k

Given how the seq() function is supposed to start from the beginning of the first range and go to the end of the last, you have an edge case here where your sequence fragments are at the ends. The method sees your 1..1626 and 4480899..4480937 and goes 1..4480937

I think you'd be better off using spliced_seq() and joining the result from that call.

Source: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

ADD COMMENT • link 3.7 years ago by Ram 44k

0

Entering edit mode

Using $feat->spliced_seq->seq instead of $feat->spliced_seq->seq solved the problem! Thanks!

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by fr ▴ 220

0

Entering edit mode

Great. Glad I could help. Istvan's solution is better, though. It comes from experience.

ADD REPLY • link 3.7 years ago by Ram 44k