from GI or protein ID to genome coordinates
4
0
Entering edit mode
8.1 years ago

Hi!

I have a list of protein blast hits in RefSeq ID format (e.g., WP_012146310.1) , and now I want to retrieve genome coordinates of genes encoding these proteins.

I am using Entrez Direct to cope with it. Having the RefSeq ID, I got gi (501096268) and genome accession number (NC_009832.1), but I cannot get the gene coordinates, although I know that the coordinates are presented in the nuccore database...

I will appreciate your suggestions on this issue. Thanks!

GI Entrez Direct genome coordinates RefSeq • 3.7k views
ADD COMMENT
5
Entering edit mode
8.1 years ago

I have inquired NCBI support service about this question. This is their answer:

The information is stored in the identical protein report. https://www.ncbi.nlm.nih.gov/protein/WP_012146310.1?report=ipg

The following is an example of multiple wp accession input, combined with grep to retrieve only those refseq genome lines:

$ efetch -db protein -id WP_012146310.1,WP_011083771.1 -format ipg |grep -P "Start|WP_" 
Source Nucleotide Accession Start Stop Strand Protein Protein Name Organism Strain Superkingdom
RefSeq NC_009832.1 3981847 3982188 + WP_012146310.1 hypothetical protein [Serratia proteamaculans] Serratia proteamaculans 568 568 Bacteria
RefSeq NC_004463.1 1037016 1037285 - WP_011083771.1 hypothetical protein [Bradyrhizobium diazoefficiens] Bradyrhizobium diazoefficiens USDA 110 USDA 110 Bacteria
ADD COMMENT
0
Entering edit mode

Makes the old code I wrote look like crap. Thanks for posting an update!

ADD REPLY
0
Entering edit mode

Wondering what would be a good idea to do this on a mass scale, like all proteins in a genome?

ADD REPLY
1
Entering edit mode
8.1 years ago
Jenez ▴ 540

I can't think of a straight up 'single line of code' answer for you, hopefully someone else can.

What I would do is to download the genome genbank file, parse it using Biopython, and look for features (with info such as db_xref="GI:501096268") and grab the genomic coordinates from there. It might take a bit of reading about biopython (I'm not sure if you know python so there's that...). If you need to do this for several genomes, just set up a script to download the genbank file for the genome first and then run the above.

I tried to grab genomic sequences from protein ID's many months ago which entailed querying NCBI for identical protein reports and grabbing the coordinates from there but it was very cumbersome work.

Sorry, that's the best I can think of. Best of luck!

ADD COMMENT
1
Entering edit mode
7.6 years ago
cmdcolin ★ 4.0k

To add onto the answer by @andrey.v.shubin I made a small script to try to query a large text file programmatically, if you have a file with a large list of protein IDs you can try and speed up NCBI for 50 protein IDs at a time

#!/usr/bin/env perl
use strict;
use warnings;

sub readNLines {
    my $lines_to_read = shift or die '....';
    my @lines_read;
    while(<>) {
        chomp;
        push( @lines_read, $_ );
        last if @lines_read == $lines_to_read;
    }
    return @lines_read;
}

sub readRecord {
    return join(',',readNLines(50))
}

while (my $lines = readRecord()) {
    my $output = `efetch -db protein -id $lines -format ipg|grep -v Start`;
    print $output;
}

With a file protein_ids.txt, one protein ID per line, then run

perl process.pl < protein_ids.txt > coords.txt
ADD COMMENT
0
Entering edit mode
8.1 years ago

I have limited experience using Entrez Direct, but I believe you want to include 'gene location [PROP]' in your query to obtain gene coordinates.

ADD COMMENT

Login before adding your answer.

Traffic: 1802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6