My code below reads in a file of a subset of the DrugBank data, and then calls UniProt.ws() to map the UniProt IDs of the drug targets to Entrez Gene ID format. This code runs and generates the output file, but the output is incorrect, and I am confused by the following issues:
My input file contains 12,370 values; however, the mapped Entrez Gene ID dataset contains 12,530 values. Given the simple R script below, I'm not sure why these additional values are being introduced. Inspecting the output file, I see that for some of the listed UniProt values, the value looks like an Entrez ID (i.e., a number with no character prefix), and the corresponding value assigned in the Entrez column is "NA". Inspecting the UniProt values in the input file, there are no such non-UniProt values present, so I'm not sure where these problematic values are originating.
Also, and concerningly, the data in the MappedData output file does not match the UniProt IDs from the input file. For example, the first UniProt ID listed in the input file is P00734, whereas the first UniProt value in the MappedData output file is O95169.
If anyone can provide insight into what is wrong with my script below such that the UniProt IDs are not being correctly mapped to Entrez Gene format, I will greatly appreciate your guidance.
# Note: for some reason, the left parentheses of this library() call isn't showing up on this post, but it is present in my actual code
libraryUniProt.ws)
DrugBank_Data <- read.csv("DrugBankData.csv")
TargetID_UniProt <- DrugBank_Data[,2]
# Stereotyped call that is always used to create a UniProt.ws object
up <- UniProt.ws(taxId=9606)
MappedData <- select(up, TargetID_UniProt, "ENTREZ_GENE")
write.csv(MappedData, "MappedData.csv")
Many thanks for your detailed response, Kevin; I have also been exploring another route, and will consider your suggestions along with my current efforts; I'll plan to post what works.
Okay, great! - Good luck