Hi,
I know that there have been multiple suggestions for getting gene ids. I have used a method to do it myself and this is just a post to see if I am doing it right. I have the oligonucleotide probe ids for a set of transcription factors and genes which are part of the Apis Mellifera gene regulatory network. What I did was:
1) I retrieved the sequences for the probes.
2) I further went on NCBI and used BLAST to match these probe sequences with the honey bee gene dataset.
3) Further, I extracted the ENTREZ IDs of all the hits in an excel sheet.
4) I converted the ENTREZ IDs for all the hits into UNIPROT IDs on UNIPROT.
5) I further, used these UNIPROT IDs to do a functional annotation analysis on DAVID.
Please advice if the methodology I followed was correct and if not, please suggest any changes that I should make to it.
Edit: I understand that the NCBI ids are also present in the file, so I can make do without sequences, however, when I converted these ENTREZ ids to UNIPROT ids, only about 126 of a list of 10,000 got converted which is far too few for my use, which is why I am forced to try this method. Tried converting the ids on DAVID but it is not recognizing my ids This is the sequence of the probes
I have reservations about your steps 2 and 3. When running BLAST, how did you eventually choose the ENTREZ ID that matched each sequence? In many situations, there must surely have been multiple matches, so, how did you choose the correct match? You should also avoid the use of Excel as an intermediary step as best as you can. Excel assumes that the user is not expert and makes many assumptions with regard to what it should do with your data. Unless you know Excel inside out, try best to avoid it.
If you have probe IDs, then I assume that these IDs relate to an Affymetrix, Illumina, or Agilent microarray? There must, therefore, be an annotation file (possibly called a 'CDF') which will provide mappings for probes to gene names.
Hi Kevin,
Thx for the reply. I did a MegaBlast and the hits being subjected to just one species (Apis Mellifera), the hits are mostly just one. In case there is more than one hit, they are all variants of the same gene. These are the list of probe ids. As, you can see, the nucleotide ids are provided as well. However, when I converted the nucleotide ids into Uniprot ids for DAVID, Only 126 out of a total of 10,000 genes got converted which is very few for my purpose. I will be vary of using excel and probably use a text file to store my ids. Thank you for that suggestion. Is there another way out of the first predicament?
In the file to which you linked ( this ), there is already a fairly good mapping for probe ID to RefSeq ID (the NM and XM IDs). I would just use those for DAVID.
Tried but DAVID gives the error "You are either not sure which identifier type your list contains, or less than 80% of your list has mapped to your chosen identifier type." Gave the id option as Refseq_mRNA and also tried with the option ENTREZ_ID
You may have to remove the suffix that indicates the transcript isoform, that is:
NM_12345.1 should be NM_12345
If you save the listing as a text file in linux or MAC OS, you can remove this easily with
sed
:Thank you Kevin. I did a DAVID analysis setting the parameter as Refseq_mRNA. Eventhough the matches are very few (out of a list of 360, only 123 of them gave some kind of functional annotation result on DAVID), I guess that will have to do. if you know of other functional annotation tools that would be better than DAVID, please do let me know. If you put up a brief answer, I could select that as an answer and close this question.