How Can I Save Bioperl Sequence Nested Features In Genbank Or Embl Format?
3
4
Entering edit mode
14.7 years ago
Ryan Thompson ★ 3.6k

In BioPerl, a sequence object can have any number of features, and each of these can have subfeatures nested within them. For example, a feature may be a complete coding sequence of a gene, and its subfeatures might be individual exons that are concatenated to form the full coding sequence.

However, when I use BioPerl to write a sequence object to a file in genbank or embl format, only the top-level features are written to the file, not the sub-features nested within the top-level features. How can I store my subfeatures in sequence files? Should I just convert all my subfeatures into top-level features, and then reconstruct the tree structure next time I read in the sequence?

Example code:

#!/usr/bin/perl

use Bio::SeqIO;
use Bio::SeqFeature::Generic;

# Read in a sequence
my $sequence = Bio::SeqIO->newFh(
    -file => $ARGV[0],
)->getline();

# Add the main feature
my $main_feature = Bio::SeqFeature::Generic->new(
    -start => 1,
    -end => 100,
    -primary_tag => 'main_feature',
);
$sequence->add_SeqFeature($main_feature);

# Add some subfeatures
for my $i (1..10) {
    my $subfeature = Bio::SeqFeature::Generic->new(
        -start => ($i * 10 - 9),
        -end => ($i * 10),
        -primary_tag => "Sub-feature $i",
    );
    $main_feature->add_SeqFeature($subfeature);
}

# Write the sequence to the output file. Subfeatures are not saved.
Bio::SeqIO->new(
    -fh => *STDOUT{IO},
    -format => "genbank",
)->write_seq($sequence) or die "FAIL";

Run the example code on any sequence file. It will print a genbank sequence record to STDOUT. This record will contain "main_feature" but not any of the ten "Sub-feature $i". If you use something like Data::Dump to dump the perl structure corresponding to $sequence, you will see that it does contain the sub features. It simply does not output them.

Edit: I just realised that there's another aspect to my question. I'm trying to write two scripts: one to analyze the sequences and add the subfeatures, and a second script to read the sequences and report on the subfeatures. So an additional requirement is that I can easily read the resulting sequences back into BioPerl including the subfeatures.

bioperl sequence data • 6.9k views
ADD COMMENT
0
Entering edit mode

Creating a RichSeq object from scratch can be a little tricky. You're probably writing you subfeatures in a limited scope. But, without a piece of code (or the modules you're using) a correct answer will be difficult.

ADD REPLY
0
Entering edit mode

Ok, I'll post some example code.

ADD REPLY
0
Entering edit mode

What are all the features you are trying to print ? Can you please paste the full code to pastebin and provide a link ? I wont be able to test the code in its current format.

ADD REPLY
0
Entering edit mode

I posted a full snippet of example code. You just need an arbitrary sequence file to run it on.

ADD REPLY
0
Entering edit mode

Another question: are you trying to convert between file formats or do you really need to annotate raw sequences. For the first option, SeqIO do the whole job (subfeatures included). For the second case, you will need my complete non-trivial answer.

ADD REPLY
0
Entering edit mode

I'm trying to find a sequence format to use as intermediate data storage between an annotator script and a reporter script.

ADD REPLY
4
Entering edit mode
14.7 years ago
Neilfws 49k

I think the problem may be that whilst Bioperl has a concept of features + subfeatures, a GenBank file does not: anything with a location is simply a feature, regardless of whether it lies within another feature. Sub-features are created by Bioperl when "rich" sequences are read and only for specific tags (such as gene/CDS/exon).

In your code, if you change the loop which generates the "subfeatures" so as the last line is:

$sequence->add_SeqFeature($subfeature);

i.e. subfeatures are added to the sequence object, not the main feature object, then the subfeatures will appear in the output (as features).

Where a feature in a GenBank file contains several lines - those are annotations, not features.

ADD COMMENT
0
Entering edit mode

Ok, so is there a sequence format the does support subfeatures? Should I forget SeqIO and just use Data::Dump?

ADD REPLY
0
Entering edit mode

Also, this is the workaround I'm using right now.

ADD REPLY
0
Entering edit mode

I guess GFF is the closest to a representation of sub-features, with its "parent" attribute. There's a gff_string method in Bio::SeqFeature, so you'd just add "parent" + the ID of the main feature as a tag of the sub-feature.

ADD REPLY
0
Entering edit mode

asciitree, chadoxml e gff do support nested references/features. There is no parser for asciitree, chadoxml is well supported in bioperl. Biopython can read/write gff (up to gff3 I guess) and bioperl can write gff. Chadoxml seems to have a bug in somes *nix distributions. Tried asciitree and worked perfectly with your code.

ADD REPLY
0
Entering edit mode

I'll return to the flatten/unflatten problem. I've generated fake genbank records some time ago from simulated sequences.

ADD REPLY
0
Entering edit mode

feel free to accept this as best answer then :-)

ADD REPLY
0
Entering edit mode

Yeah. I did. I ended up implementing a custom "unflatten" subroutine in the second script.

ADD REPLY
2
Entering edit mode
14.7 years ago
Chris Fields ★ 2.2k

Neil is correct, GenBank/EMBL do not layer features (e.g. everything is top-level), and BioPerl makes no assumptions based on that: WYSIWYG. There are a few scripts bp_genbank2gff3.pl) that help along these lines, in that it helps convert to a format that allows hierarchal features, gene > mRNA > CDS/exon/UTR and so on. The caveat is there is enough variability between records that finding a consistent way to do this is still difficult.

ADD COMMENT
1
Entering edit mode
14.7 years ago

AFAIK you have to define the subfeatures explicitly in the script to generate the genbank or embl format. Have you referred to the Feature-Annotation HOWTO. If you can show the code, it will be better.

ADD COMMENT
0
Entering edit mode

What I mean is, I have a script that reads in some sequences, annotates them with features and subfeatures, and then writes the sequences to an output file. But when I do that, the subfeatures are lost.

ADD REPLY

Login before adding your answer.

Traffic: 1132 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6