Thanks for any answers in advance. I have a set of ~1100 protein GIs from mitochondrial and plastid genomes, and am trying to use perl with eutilities to get the corresponding coding DNA sequences. My strategy has been to use elink -> efetch. Elink gives me the associated GI number for the coding DNA sequence, but have run into the problem that the GI links to the entire genome sequence, and not the particular coding region.
Posting an answer to my own question - figured it out after beating my head against the wall for a day or so.
Solution is to go through esummary after the elink. From the esummary the chromosome positions and coding strand can be parsed out. These features are then used in the efetch call to retrieve the DNA sequence. Here's my Perl script (sorry, not a BioPerl type of guy).
#!/usr/bin/env perl -w
use strict;
use LWP::Simple;
my $protein_id = "256427307";
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $url = $base . "elink.fcgi?dbfrom=protein&db=gene&id=$protein_id";
my $output = get($url);
#print "$output";
my $linked_gi;
while ($output =~ /<LinkSetDb>(.*?)<\/LinkSetDb>/sg) {
my $linkset = $1;
if ($linkset =~ /<Id>(\d+)<\/Id>/sg) {
$linked_gi = $1;
}
#print "linked GI = $linked_gi\n";
}
my $url3 = $base . "esummary.fcgi?db=gene&id=$linked_gi";
my $docsum = get($url3);
#print "$docsum\n";
my $start;my$stop;my$strand=1;my$chr_ver;
while ($docsum =~ /<GenomicInfo>(.*?)<\/GenomicInfo>/sg) {
my $data= $1; print "$data";
if ($data =~ /<ChrAccVer>(.*)<\/ChrAccVer>/sg) {
$chr_ver =$1;
}
if ($data =~ /<ChrStart>(\d+?)<\/ChrStart>/sg) {
$start =$1;
}
if ($data =~ /<ChrStop>(\d+?)<\/ChrStop>/sg) {
$stop =$1;
}
$strand = 2 if $start > $stop;
}
print "$chr_ver:$start _ $stop\n";
my $url4 = $base."efetch.fcgi?db=nucleotide&id=$chr_ver&rettype=fasta&retmode=text&seq_start=$start&seq_stop=$stop&strand=$strand";
my $seq = get($url4);
print "$seq\n";
exit;
Many thanks for posting the solution, however does it extract the exact coding sequence for a particular protein?
For example, if we look at this protein. This is the exact coding sequence for that protein. But when I run your script, changing "protein_ID" to 635010873 (the GI number for the protein), it returns this, the full mRNA sequence for this region, but not the coding sequence for the protein of interest?
Basically, my aim is to obtain the longest canonical transcript for each gene. I can obtain a list of proteins and their lengths, and then I was thinking of extracting the longest protein's coding sequence to give me the longest canonical transcript. I've also asked similar questions here and here; if you could shed any light on the topic!
Many thanks for posting the solution, however does it extract the exact coding sequence for a particular protein?
For example, if we look at this protein. This is the exact coding sequence for that protein. But when I run your script, changing "protein_ID" to 635010873 (the GI number for the protein), it returns this, the full mRNA sequence for this region, but not the coding sequence for the protein of interest?
Basically, my aim is to obtain the longest canonical transcript for each gene. I can obtain a list of proteins and their lengths, and then I was thinking of extracting the longest protein's coding sequence to give me the longest canonical transcript. I've also asked similar questions here and here; if you could shed any light on the topic!