Hello All,
This is probably a "newbie" question.
I am trying to take some standard BLAST output and map the alignments to Gene ID's, so that I can do enrichment/network analysis.
Now I am doing something out of the ordinary: I am looking at multiple microorganisms at once. I think this might be a major difficulty in converting: some databases may not include homologs or hypothetical proteins. However, I am very new to this problem, having no previous knowledge of Gene ID systems.
Here is my output, blasting against an NCBI database. (Obviously, I have thousands of lines, this is just a random example.)
queryNAME gi|367018053|ref|NC_016508.1| 90.20 51 5 0 1 51 788427 788477 1e-09 67.6
so I can easily find this in NCBI database
http://www.ncbi.nlm.nih.gov/nuccore/367018053
and then
http://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=full_report&list_uids=11505342
I can easily parse all this from NCBI using Edirect efetch -db nuccore -id "NC_016508" -mode xml
So, I now have three names:
GI 367018053
ACCESSION NC_016508
Gene symbol TDEL0H00120
There are many posts about plenty of available Gene ID converters. Gene Id Conversion Tool However, I seem to have a "Catch-22": I don't know what database these ID's belong to, which is ultimately necessary for converting to another system. (I mean, I generally know what these are, but apparently I need to be very specific in selecting from a big list of possibilities.) On the other hand, maybe I am being unsuccessful because this is a hypothetical gene, so there is nothing to convert it to in other lists.
Can anyone offer some guidance on (1) how to convert these successfully and (2) more generally, are there special issues to consider when not using a single organism?
I've used tblastx against RefSeq databases for similar work. Are you using one of these?