I have a list of 100 Accession numbers of bacterial genomes. I want to retrieve the 16s rRNA sequence alone from their genbank record, in field "rRNA", product '16s ribosomal RNA' using PERL or MATLAB. Can you guide me to a book or tutorial for it?
Thank you!
For learning Perl in general you may look at books mentioned here or tutorial. Bioinformatics related books are old but can be useful as this one has a chapter on parsing Genbank data. In addition to this you may look at BioPerl documentation.
For your case I created small script where 16S rRNA sequences are extracted from a bacteria with accession 'NC_013961'. For it to work you will need to install module Bio::DB::GenBank with cpan.
use strict;
use warnings;
use Bio::DB::GenBank;
my $gb = Bio::DB::GenBank->new();
# get data by accession
my $seqio = $gb->get_Stream_by_acc('NC_013961');
my $seq = $seqio->next_seq;
foreach my $feat ( $seq->get_SeqFeatures() ) {
# skip tags that are not "rRNA"
next unless (($feat->primary_tag eq "rRNA") and $feat->has_tag("product"));
# print sequence only if rRNA product tag value includes '16S'
my @rRNA = $feat->get_tag_values('product');
if ( grep( /^16S/, @rRNA ) ) {
print $feat->seq->seq, "\n";
}
}