Formatting problem when converting from UniProt to Entrez Gene ID format
1
0
Entering edit mode
7.3 years ago

My code below reads in a file of a subset of the DrugBank data, and then calls UniProt.ws() to map the UniProt IDs of the drug targets to Entrez Gene ID format. This code runs and generates the output file, but the output is incorrect, and I am confused by the following issues:

My input file contains 12,370 values; however, the mapped Entrez Gene ID dataset contains 12,530 values. Given the simple R script below, I'm not sure why these additional values are being introduced. Inspecting the output file, I see that for some of the listed UniProt values, the value looks like an Entrez ID (i.e., a number with no character prefix), and the corresponding value assigned in the Entrez column is "NA". Inspecting the UniProt values in the input file, there are no such non-UniProt values present, so I'm not sure where these problematic values are originating.

Also, and concerningly, the data in the MappedData output file does not match the UniProt IDs from the input file. For example, the first UniProt ID listed in the input file is P00734, whereas the first UniProt value in the MappedData output file is O95169.

If anyone can provide insight into what is wrong with my script below such that the UniProt IDs are not being correctly mapped to Entrez Gene format, I will greatly appreciate your guidance.

# Note: for some reason, the left parentheses of this library() call isn't showing up on this post, but it is present in my actual code
libraryUniProt.ws)

DrugBank_Data <- read.csv("DrugBankData.csv")

TargetID_UniProt <- DrugBank_Data[,2]

# Stereotyped call that is always used to create a UniProt.ws object
up <- UniProt.ws(taxId=9606)

MappedData <- select(up, TargetID_UniProt, "ENTREZ_GENE")

write.csv(MappedData, "MappedData.csv")
R conversion UniProt Entrez • 2.2k views
ADD COMMENT
0
Entering edit mode
7.3 years ago

I have never used the package UniProt.ws, so, I don't know if there are any possible parameters in the select() function that my help. There is a previous thread here: Extracting domain list for proteins Using UniProt.ws in R

Otherwise, may I recommend the use of biomaRt? I have used this in the past to convert between ENTREZ and RefSeq Official Gene Symbols. For uniprot, for your data, I think that the code would be something like:

require(biomaRt)

uniprot <- useMart("unimart", dataset="uniprot")

annots <- getBM(mart=uniprot, attributes=c("uniprot_swissprot", "ensembl_gene_id"), filter="uniprot_swissprot", values=TargetID_UniProt, uniqueRows=TRUE)

The uniqueRows parameter is important, and there are also other attributes that you can have returned, such as gene_biotype and external_gene_name.

Finally, there is a useful tutorial for using biomaRt, including with uniprot, here: http://www.ensembl.org/info/data/biomart/biomart_r_package.html

Hope that this helps

Kevin

ADD COMMENT
1
Entering edit mode

Many thanks for your detailed response, Kevin; I have also been exploring another route, and will consider your suggestions along with my current efforts; I'll plan to post what works.

ADD REPLY
1
Entering edit mode

Okay, great! - Good luck

ADD REPLY

Login before adding your answer.

Traffic: 1836 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6