I have a list of protein blast hits in RefSeq ID format (e.g., WP_012146310.1) , and now I want to retrieve genome coordinates of genes encoding these proteins.
I am using Entrez Direct to cope with it. Having the RefSeq ID, I got gi (501096268) and genome accession number (NC_009832.1), but I cannot get the gene coordinates, although I know that the coordinates are presented in the nuccore database...
I will appreciate your suggestions on this issue. Thanks!
I can't think of a straight up 'single line of code' answer for you, hopefully someone else can.
What I would do is to download the genome genbank file, parse it using Biopython, and look for features (with info such as db_xref="GI:501096268") and grab the genomic coordinates from there. It might take a bit of reading about biopython (I'm not sure if you know python so there's that...).
If you need to do this for several genomes, just set up a script to download the genbank file for the genome first and then run the above.
I tried to grab genomic sequences from protein ID's many months ago which entailed querying NCBI for identical protein reports and grabbing the coordinates from there but it was very cumbersome work.
Sorry, that's the best I can think of. Best of luck!
To add onto the answer by @andrey.v.shubin I made a small script to try to query a large text file programmatically, if you have a file with a large list of protein IDs you can try and speed up NCBI for 50 protein IDs at a time
#!/usr/bin/env perl
use strict;
use warnings;
sub readNLines {
my $lines_to_read = shift or die '....';
my @lines_read;
while(<>) {
chomp;
push( @lines_read, $_ );
last if @lines_read == $lines_to_read;
}
return @lines_read;
}
sub readRecord {
return join(',',readNLines(50))
}
while (my $lines = readRecord()) {
my $output = `efetch -db protein -id $lines -format ipg|grep -v Start`;
print $output;
}
With a file protein_ids.txt, one protein ID per line, then run
Makes the old code I wrote look like crap. Thanks for posting an update!
Wondering what would be a good idea to do this on a mass scale, like all proteins in a genome?