Question

Converting BLAST Alignments (NCBI database) to Gene ID

1

Entering edit mode

9.5 years ago

jeremy.cox.2 ▴ 130

Hello All,

This is probably a "newbie" question.

I am trying to take some standard BLAST output and map the alignments to Gene ID's, so that I can do enrichment/network analysis.

Now I am doing something out of the ordinary: I am looking at multiple microorganisms at once. I think this might be a major difficulty in converting: some databases may not include homologs or hypothetical proteins. However, I am very new to this problem, having no previous knowledge of Gene ID systems.

Here is my output, blasting against an NCBI database. (Obviously, I have thousands of lines, this is just a random example.)

queryNAME  gi|367018053|ref|NC_016508.1|   90.20   51      5       0       1       51      788427  788477  1e-09   67.6

so I can easily find this in NCBI database

http://www.ncbi.nlm.nih.gov/nuccore/367018053

and then

http://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=full_report&list_uids=11505342

I can easily parse all this from NCBI using Edirect efetch -db nuccore -id "NC_016508" -mode xml

So, I now have three names:

GI    367018053
ACCESSION NC_016508
Gene symbol   TDEL0H00120

There are many posts about plenty of available Gene ID converters. Gene Id Conversion Tool However, I seem to have a "Catch-22": I don't know what database these ID's belong to, which is ultimately necessary for converting to another system. (I mean, I generally know what these are, but apparently I need to be very specific in selecting from a big list of possibilities.) On the other hand, maybe I am being unsuccessful because this is a hypothetical gene, so there is nothing to convert it to in other lists.

Can anyone offer some guidance on (1) how to convert these successfully and (2) more generally, are there special issues to consider when not using a single organism?

BLAST NCBI GENE-ID • 5.3k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.5 years ago by jeremy.cox.2 ▴ 130

0

Entering edit mode

I've used tblastx against RefSeq databases for similar work. Are you using one of these?

ADD REPLY • link 9.5 years ago by burkhart.joshua ▴ 30

Ram · Answer 1 · 2015-11-02

0

Entering edit mode

9.5 years ago

Jean-Karim Heriche 27k

All these are GenBank identifiers. They are explained here.

ADD COMMENT • link 9.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Yes. So for example, I would expect these ID's to convert using the uniprot converter

http://www.uniprot.org/uploadlists/

However, identifying these as "GI number*", "EMBL/GenBank/ DDBJ" returns no results.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by jeremy.cox.2 ▴ 130

0

Entering edit mode

It looks like you don't get IDs that correspond/map to proteins. As you point out, your example is a hypothetical gene so it may not be represented by a protein in UniProt.

If you're trying to identify UniProt proteins, why not blastx your nucleotide sequences directly against a UniProt database?

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Jean-Karim Heriche 27k