Question

Retrieve Chromosome Number And Position From Gene Id In Danio Rerio

0

Entering edit mode

13.0 years ago

Eric Normandeau 11k

Hi,

I'm working on a project in which I am interested to know where the proteins for which I have a nucleotide sequence in one fish species are found (chromosome and position) on the Danio rerio (zebrafish) genome. I blast my sequences against the Danio rerio transcriptome, extracted from the 'nr' database and, I then get geneIDs in the following format:

 gi|47087391|ref|NP_998590.1|
 gi|56090491|ref|NP_001007792.1|
 gi|169154248|emb|CAQ15172.1|
 gi|189523697|ref|XP_001341635.2|
 gi|189526610|ref|XP_687146.3|

From these, I would like to know the chromosome number and position on the chromosome of these genes on the Danio rerio genome (Zv9). Given that I have close to a thousand of these IDs, I want this process to be automated.

I can browse the zebrafish genome on different genome browsers, but how can I automate my search?

Many thanks

genome search • 7.7k views

ADD COMMENT • link 13.0 years ago by Eric Normandeau 11k

0

Entering edit mode

If it's the nr database, then those are not "gene IDs". The gi is a unique identifier for the protein database; the second part is a protein accession.

ADD REPLY • link 13.0 years ago by Neilfws 49k

0

Entering edit mode

Ok, noted. Given your answer, I looked for another option and found what I needed. I'll post it as an answer.

ADD REPLY • link 13.0 years ago by Eric Normandeau 11k

score 2 · Answer 1 · 2011-12-16

This is actually quite a tricky problem, for several reasons.

Your first identifier is a protein GI - a unique identifier used by NCBI Entrez. There's no simple way to go from a GI to chromosomal location using NCBI data or services.
The second identifier is a protein accession, but these link to several different databases. For example NP_ is Refseq, XP_ is Refseq predicted, CAQ15172.1 is EMBL. This makes it difficult to query services using e.g. BioMart, unless you run separate queries for each type of accession.
Your main problem though, is that you are using proteins to get to nucleotide data.

If I were using BLAST for this purpose I would:

Download the nucleotide sequences of D. rerio chromosomes
Format them as a BLAST database
BLAST search using tblastn (if my queries were protein sequences) or blastn (if my queries were transcript sequences)

And then my BLAST report would contain chromosome and location.

score 1 · Answer 2 · 2011-12-16

Given the insights and recommendations from @neilfws, I used ensemble to retrieve all the protein sequences from Danio rerio in fasta format. This solution is better for me than blasting on the nucleotide sequences, which I had already done, since I really only want to blast on coding regions. Moreover, the sequence names now contain the information I need, ie: the chromosome number to which they belong, as opposed to what I was getting from my Danio subset of the 'nr' database.

I can now blast my sequences using blastx and know on what chromosome they hit.

Thanks for the suggestions!

score 0 · Answer 3 · 2011-12-14

0

Entering edit mode

13.0 years ago

Damian Kao 16k

You can download the genebank format file for those sequences and extract the feature information. For example one of the genes you listed: http://www.ncbi.nlm.nih.gov/protein/CAQ15172.1

If you scroll down in the genebank file, you can see there are feature information and the source of this protein is in chromosome 9.

You can use BioPerl's [?]efetch[?] module to download the genebank files. And then use [?]seqIO module[?] to parse the genebank files to extract the feature information.

ADD COMMENT • link 13.0 years ago by Damian Kao 16k

0

Entering edit mode

Problem with this approach is that the coordinates in the file are not the chromosomal coordinates. Also, this is a Genpept file (not Genbank) - also it's Genbank, not genebank :)

ADD REPLY • link 13.0 years ago by Neilfws 49k

0

Entering edit mode

Problem with this approach: the NCBI link in the answer is a Genpept file, not Genbank, which means the coordinates are for the protein sequence, not the chromosome. Also note it's Genbank, not genebank.

ADD REPLY • link 13.0 years ago by Neilfws 49k