I found a good solution but it is really cumbersome.
I use UniProt.ws to derive unique gene names (readable names) and also ambiguous names. I then collapse the columns of unique and ambiguous names into one column, so I have all possible combinations per gene/protein. Later, when I want to integrate with another data set, I separate each possible name into one row so any potential match will work.
data <- "data/to/protoemics/"
#find gene names corresponding to uniprot ID
library(UniProt.ws)
new_names <- mapUniProt("UniProtKB_AC-ID", "Gene_Name", query =data$UniprotAccession) #unique gene names (Uniprot)
new_names2 <- mapUniProt("UniProtKB_AC-ID", "UniProtKB", query =data$UniprotAccession) #UniProtKB contains a lot of info and multiple protein names
new_names3 <- merge(new_names, new_names2[c("From", "Length","Gene.Names")], by.x = "From", by.y="From")
proteomics_named <- merge(new_names3, data, by.x="From", by.y="UniprotAccession")
# Combining two columns with a space (Gene.Names contains only ambiguous gene names and needs to be combined with the unique gene names columns)
proteomics_named <- proteomics_named %>%
mutate(Gene.Names = paste(Gene.Names, To, sep = " "))
#now duplicated gene names might occur in the column use this function to clean up
remove_duplicates <- function(text) {
words <- strsplit(text, " ")[[1]] # Split the string into words
unique_words <- unique(words) # Remove duplicate words
paste(unique_words, collapse = " ") # Collapse into a single string
}
# Applying the function to the 'Text' column to remove duplicated names
proteomics_named <- proteomics_named %>%
mutate(Gene.Names = sapply(Gene.Names, remove_duplicates))
#now separate all gene.names (multiple gene.names per row present) for better merging
proteomics_named <- proteomics_named %>%
separate_rows(Gene.Names, sep = " ")
This table from Jackson labs that correlates many mouse identifiers with external databases will help. Look at the headers of the columns to find the various databases.
https://www.informatics.jax.org/downloads/reports/MRK_Sequence.rpt
looks good but I cannot find either Nupl2 nor Nup42 in the list!
I can see this in the list
UniProt also has a id-mapping tool that you may find useful: https://www.uniprot.org/id-mapping/
you are right! Excel just did not find it...
Hello,
This conversion tool has been useful to me, maybe it can help you:
https://biit.cs.ut.ee/gprofiler/convert
For such analysis I always first translate everything to Ensembl Gene IDs (biomaRt is a good help here) and then I do the matching with this. These different identifiers and names are a pest, and imo gene ID is the only real universal constant (at least within the same gene annotation version).