Fetch Many Files With Accession Number, Output File Format Is Coding Sequences In Fasta Per
3
0
Entering edit mode
13.1 years ago

Hi all,

How do I fetch many many files according to accession numbers from Genbank. The output I would like is one file per accession numbers. The format of output file is Coding Sequences in Fasta. Any suggestion are highly appreciated.

Thanks, Shihai

• 2.9k views
ADD COMMENT
1
Entering edit mode
13.1 years ago
fransua ▴ 390

perhaps you can use biomart at ensembl: http://www.ensembl.org/biomart/martview/ or directly from biomart: http://www.biomart.org/

ADD COMMENT
0
Entering edit mode
13.1 years ago
Fabian Bull ★ 1.3k

If you are able to code have a look at: using Bioperl to retrieve multiple sequences

Just change the query variable accordingly.

ADD COMMENT
0
Entering edit mode
13.1 years ago
Woa ★ 2.9k

You can download the nucleic acid sequences in the GENBANK format using the following piece of code:Run through a for loop for many acc. nos. Then one by one parse the files to get the CDS.

use Bio::DB::GenBank;
use Bio::SeqIO;
use strict;

my $acc='NM_001003933';#example entry
my $format='genbank';
my $file_name=$acc.'.gbk';

my $seqout = new Bio::SeqIO( -file => ">$file_name", -format=>$format);
    my $getseq = new Bio::DB::GenBank;
    my $seq = $getseq->get_Seq_by_acc($acc);
    $seqout->write_seq($seq);

This piece of code can parse the GENBANK files to get the CDS in fasta format.However where CDS is not availble like in case of pseudogenes or rRNA etc this can crash. The scalar(@cds_features) will return zero then. Use it if you have non coding sequences

    use strict;
    use warnings;
    use Bio::SeqIO;
    my $gb_file="NM_001003933.gbk";
    my @cds_features = grep { $_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $gb_file)->next_seq->get_SeqFeatures;

    my @tags=qw/protein_id product translation/;
    my ($feat_object)=@cds_features;
    my ($id,$product,$translation)=map{$feat_object->get_tag_values($_);}@tags;

print ">",$id," ",$product,"\n",$translation,"\n";
ADD COMMENT
0
Entering edit mode

Thanks, the only problem is this give me proteins and what I need is DNA. Since at Genbank web interface I can get a file looks like:

lcl|CP003084.1cdsidAER04537.1 [gene=TIIST4400005] [protein=tyrosine] [proteinid=AER04537.1] [location=complement(250..1506)] GTGTCGGATCTGTTGGACGACCTGGAGTGGAGGGGCCTCGTCGCTGACTCCACCGACCGTGAGGCGTT lcl|CP003084.1cdsidAER04538.1 [gene=TIIST4400010] [protein=arginine repressor] [proteinid=AER04538.1] [location=complement(1590..2138)] ATGAGTGAGGGGCAGTGTGCACCGACCTCACGTACGGCGCGGCATGCCCGGATTCGCGTCTTAATTGA . . . How can I it from efetch directly in Genbank?

ADD REPLY

Login before adding your answer.

Traffic: 2687 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6