You have 2 problems to solve here: fetching the sequences and parsing them to extract CDS.
To deal with the second problem: install Bioperl, if you have not already done so. Then, take a look at the SeqIO how-to. If you installed the accessory scripts, there's a handy utility named bp_extract_feature_seq
, which you can run like this:
bp_extract_feature_seq -i NC_005213.gb --format genbank --feature=CDS -o NC_005213.fa
It will write a fasta file containing the coding sequences of all CDS features.
You'll want to automate the process of fetching sequences by looping through the replicon accessions. Here's some sample code from the Bioperl tutorial which will fetch a sequence from RefSeq and write it in GenBank format:
#!/usr/bin/perl
use strict;
use Bio::Perl;
my $seq_object = get_sequence('refseq', "NC_005213");
write_sequence(">NC_005213.gb", 'genbank', $seq_object);
It should not be too hard to write a loop into that, using the NC_*
accessions from your file.
+1 Thanks! Just two more things:
bp_extract_feature_seq
? I would like to this all from a perl script, and it would be nicer to call a function instead of usingsystem("bp_extract_feature_seq",..)
locus tag
(not the entire CDSs in the replicon)?I would take a look at the code in bp_extract_feature_seq - it's just a perl script. I only mentioned it for the convenience, but it should be easy to build something around SeqIO, if you study the how-to.