Question

Converting gene symbols to protein (uniprot) ids gives multiple matches per gene symbol. Why?

0

Entering edit mode

3.9 years ago

peter.berry5 ▴ 60

I used the bitr function from the clusterProfiler package to convert gene symbols from a DE experiment to UniProt protein ids. For some unique gene symbols, there are multiple UniProt ids.

Surely each gene id should map to a single protein and each protein has a unique id. So is my code correct and does it matter that there are multiple UniProt ids for a single gene?

My code is

Genes <- c("AACS", "ACAA2", "ACADM", "ACLY", "ACOT8")
Protein_IDs <- bitr(Genes, fromType="SYMBOL", toType="UNIPROT", OrgDb="org.Hs.eg.db") # returns 15 rows
test <- distinct(Protein_IDs, UNIPROT, .keep_all = TRUE) # returns 15 rows

SYMBOL UNIPROT
AACS    Q86V21
AACS    A0A024RBV2
ACAA2   B3KNP8
ACAA2   P42765
ACADM   A0A0S2Z366
ACADM   P11310
ACADM   B7Z9I1
ACADM   Q5HYG7
ACADM   Q5T4U5
ACADM   B4DJE7
ACLY    A0A024R1T9
ACLY    P53396
ACLY    Q4LE36
ACLY    A0A024R1Y2
ACOT8   O14734

Uniprot r • 6.8k views

ADD COMMENT • link updated 3.9 years ago by cpad0112 21k • written 3.9 years ago by peter.berry5 ▴ 60

1

Entering edit mode

This seems to use a really lenient mapping with unreviewed entries etc. You may have better luck using biomaRt.

ADD REPLY • link 3.9 years ago by Ram 45k

0

Entering edit mode

HUGO entry for ACADM indeed lists only one UniProt accession.

You can download an official list of human gene symbols and their corresponding UniProt ID's from HUGO site using a custom download. Select things you want in output.

ADD REPLY • link 3.9 years ago by GenoMax 151k

0

Entering edit mode

Apologies. I actually wanted to report a solution I found using the "mygene" package. I hit the delete button in error.

Edit:

see below for an update and further query

ADD REPLY • link 3.9 years ago by peter.berry5 ▴ 60

0

Entering edit mode

Please edit your answer and add some code so people facing similar problems will have a starting point for their solutions.

ADD REPLY • link 3.9 years ago by Ram 45k

0

Entering edit mode

3.9 years ago

Elisabeth Gasteiger ★ 2.4k

You could also use the UniProt IDmapping service at https://www.uniprot.org/uploadlists . It is possible to map to UniProtKB/Swiss-Prot only, i.e. not have unreviewed entries returned. This can also be done programmatically as described at https://www.uniprot.org/help/api%5Fidmapping

ADD COMMENT • link 3.9 years ago by Elisabeth Gasteiger ★ 2.4k

score 1 · Accepted Answer · 2021-07-30

1

Entering edit mode

3.9 years ago

peter.berry5 ▴ 60

I used the following package and code to get the uniprot ids and names of the proteins which my DE genes code for.

Library (mygene)
Genes <- c("AACS", "ACAA2", "ACADM", "ACLY", "ACOT8")
Protein_IDs <- queryMany(Genes, scopes = "symbol", 
                              fields = c("name", "uniprot",  "ensemblgene"), 
                              species = "human", as_dataframe = "True")

and then used

df1 <- data.frame(hmap_4$query)
df2 <- data.frame(hmap_4$name)
df3 <- data.frame(hmap_4$uniprot.Swiss.Prot)

to extract the info I wanted.

However, today the last line gave the following error

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 0, 1, 2

When I looked at the structure of Protein IDs (which has class DFrame) I noticed that while "query" and "name" are both of type character with 615 onservations, "uniprot.Swiss.Prot" is of type list with 615 entries.

Can anybody advise how I extract this list? Thanks

ADD COMMENT • link 3.9 years ago by peter.berry5 ▴ 60

0

Entering edit mode

source of hmap_4? Please post the str of hmap_4.

ADD REPLY • link 3.9 years ago by cpad0112 21k

0

Entering edit mode

apologies, hmap_4 is a typo from a different analysis. the lines refering to df should read

df1 <- data.frame(Protein_IDs$query)
df2 <- data.frame(Protein_IDs$name)
df3 <- data.frame(Protein_IDs$uniprot.Swiss.Prot)

str(Protein_IDs)

gives

Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  ..@ rownames       : NULL
  ..@ nrows          : int 443
@ listData       :List of 7
  .. ..$ query             : chr [1:443] "AACS" "ABCB6" "ABCF1" "ABT1" ...
  .. ..$ _id               : chr [1:443] "65985" "10058" "23" "29777" ...
  .. ..$ X_score           : num [1:443] 87.7 90.3 87.9 88.4 87.1 ...
  .. ..$ name              : chr [1:443] "acetoacetyl-CoA synthetase" "ATP binding cassette subfamily B member 6 (Langereis blood group)" "ATP binding cassette subfamily F member 1" "activator of basal transcription 1" ...
  .. ..$ notfound          : logi [1:443] NA NA NA NA NA NA ...
  .. ..$ uniprot.Swiss.Prot:List of 443
  .. .. ..$ : chr "Q86V21"
  .. .. ..$ : chr "Q9NP58"

etc.

ADD REPLY • link 3.9 years ago by peter.berry5 ▴ 60

0

Entering edit mode

Protein_IDs is already in a specialized data frame format (DFrame). Type in as.data.farme(Protein_IDs) to convert entire object to data frame and extract column of interest, or Protein_IDs["uniprot.Swiss.Prot"] to get only swissprot entries.

ADD REPLY • link 3.9 years ago by cpad0112 21k

0

Entering edit mode

I tried

Protein_IDs["uniprot.Swiss.Prot"]

but didn't quite get the solution I wanted. However, your explanation of DFrame which I hadn't encountered before lead me to the following code

    Symbol <- c("AACS", "ACAA2", "ACADM", "ACLY", "ACOT8")
    FC <- c("1", "1.5", "-.2", "6", "-10") 
    Gene <- data.frame(Symbol, FC)
    names(Gene)[1] <- "HGNC_Symbol"
    Protein.IDs <- queryMany(Gene$HGNC_Symbol, scopes = "symbol", 
                                  fields = c("name", "uniprot", "ensemblgene"), 
                                  species = "human", as_data.frame = "True")
    df3 <- Protein.IDs["uniprot.Swiss.Prot"]
   df4 <- as.data.frame(matrix(unlist(Protein.IDs), 
                            nrow=length(unlist(Protein.IDs[1])))) 
df4 <- dplyr::select(df4, c(V1, V4, V5))
names(df4)[1] <- "HGNC_Symbol"
names(df4)[2] <- "Protein"
names(df4)[3] <- "uniprot.ID"
df5 <- dplyr::full_join(df4, Gene, by.x = "HGNC_Symbol", 
                           by.y = "HGNC_Symbol")

which now gives the following error

Joining, by = "HGNC_Symbol"
Error: Can't join on `x$HGNC_Symbol` x `y$HGNC_Symbol` because of incompatible types.
i `x$HGNC_Symbol` is of type <list>>.
i `y$HGNC_Symbol` is of type <character>>.

creating df$5 so I have a data frame with Gene symbol, FC, protein name and uniprot ID is the ultimate goal.

ADD REPLY • link 3.9 years ago by peter.berry5 ▴ 60

1

Entering edit mode

> library(mygene)
> library(dplyr)
> HGNC_Symbol <- c("AACS", "ACAA2", "ACADM", "ACLY", "ACOT8")
> FC <- c("1", "1.5", "-0.2", "6", "-10") 
> df1 <- data.frame(HGNC_Symbol, FC)
> Protein.IDs <- queryMany(df1$HGNC_Symbol, scopes = "symbol", 
+                          fields = c("name", "uniprot", "ensemblgene"), 
+                          species = "human", as_data.frame = "True")
Finished
> df2=as.data.frame(Protein.IDs[c("query","name","uniprot.Swiss.Prot")])
> df1 %>% 
+     inner_join(df2, by=c("HGNC_Symbol"="query"))
  HGNC_Symbol   FC                                name uniprot.Swiss.Prot
1        AACS    1          acetoacetyl-CoA synthetase             Q86V21
2       ACAA2  1.5        acetyl-CoA acyltransferase 2             P42765
3       ACADM -0.2 acyl-CoA dehydrogenase medium chain             P11310
4        ACLY    6                   ATP citrate lyase             P53396
5       ACOT8  -10             acyl-CoA thioesterase 8             O14734

There seems to be discrepancy between outputs I am getting and your posting. Please check your R and R library versions.

ADD REPLY • link 3.9 years ago by cpad0112 21k