Question

Integration of transcriptomics and proteomics: difficult matching names

0

Entering edit mode

8 months ago

ntsopoul ▴ 60

Hi to all,

I want to integrate proteomics data and transcriptomic data but I have problems finding common identifiers.

For proteomics, I have Uniprot-ID and Gene_name(by unirprot); for mRNA-seq, I have gene_id (mm9.refGene). I looked into the mRNA-seq .gtf file and there is only one transcript ID that I could use as an alternative.

To illustrate the problem:

Nup42 (UniProt Gene_name) does not match (Nupl2 Gene_id).

Is there a way to convert the Gene_id from mRNA-seq to Uniprot Gene_name?

What are the best common identifiers for mRNA and protein data?

rna-seq tmt nomenclature proteomics • 901 views

ADD COMMENT • link 8 months ago by ntsopoul ▴ 60

1

Entering edit mode

This table from Jackson labs that correlates many mouse identifiers with external databases will help. Look at the headers of the columns to find the various databases.

https://www.informatics.jax.org/downloads/reports/MRK_Sequence.rpt

ADD REPLY • link 8 months ago by GenoMax 148k

0

Entering edit mode

looks good but I cannot find either Nupl2 nor Nup42 in the list!

ADD REPLY • link 8 months ago by ntsopoul ▴ 60

1

Entering edit mode

I can see this in the list

GI:2387631  Nup42   O   Gene    nucleoporin 42  10.72   5   24369961    24389011    +   AA867018|AB067574|AI426644|AK077066|AK078478|AK183925|AK194837|AK207606|AK216216|AV116043|BB396657|BC033270|BQ442924    NM_001346582|NM_153092|XM_030254373 ENSMUST00000049887|ENSMUST00000115101|ENSMUST00000124150|ENSMUST00000147392 Q8CIC2  A0A0R4J1K6|E9QL43   ENSMUSP00000062766|ENSMUSP00000110753   NP_001333511|NP_694732|XP_030110233     protein coding gene

UniProt also has a id-mapping tool that you may find useful: https://www.uniprot.org/id-mapping/

ADD REPLY • link 8 months ago by GenoMax 148k

0

Entering edit mode

you are right! Excel just did not find it...

ADD REPLY • link 8 months ago by ntsopoul ▴ 60

1

Entering edit mode

Hello,

This conversion tool has been useful to me, maybe it can help you:

https://biit.cs.ut.ee/gprofiler/convert

ADD REPLY • link 8 months ago by sansan96 ▴ 140

0

Entering edit mode

For such analysis I always first translate everything to Ensembl Gene IDs (biomaRt is a good help here) and then I do the matching with this. These different identifiers and names are a pest, and imo gene ID is the only real universal constant (at least within the same gene annotation version).

ADD REPLY • link 8 months ago by ATpoint 86k

score 0 · Answer 1 · 2024-04-16

I found a good solution but it is really cumbersome. I use UniProt.ws to derive unique gene names (readable names) and also ambiguous names. I then collapse the columns of unique and ambiguous names into one column, so I have all possible combinations per gene/protein. Later, when I want to integrate with another data set, I separate each possible name into one row so any potential match will work.

data <-  "data/to/protoemics/"

#find gene names corresponding to uniprot ID
library(UniProt.ws)
new_names <- mapUniProt("UniProtKB_AC-ID", "Gene_Name", query =data$UniprotAccession) #unique gene names (Uniprot)
new_names2 <- mapUniProt("UniProtKB_AC-ID", "UniProtKB", query =data$UniprotAccession) #UniProtKB contains a lot of info and multiple protein names
new_names3 <- merge(new_names, new_names2[c("From", "Length","Gene.Names")], by.x = "From", by.y="From")
proteomics_named <- merge(new_names3, data, by.x="From", by.y="UniprotAccession")

# Combining two columns with a space (Gene.Names contains only ambiguous gene names and needs to be combined with the unique gene names columns)
proteomics_named <- proteomics_named %>%
  mutate(Gene.Names = paste(Gene.Names, To, sep = " "))

#now duplicated gene names might occur in the column use this function to clean up
remove_duplicates <- function(text) {
  words <- strsplit(text, " ")[[1]]  # Split the string into words
  unique_words <- unique(words)      # Remove duplicate words
  paste(unique_words, collapse = " ")  # Collapse into a single string
}

# Applying the function to the 'Text' column to remove duplicated names
proteomics_named <- proteomics_named %>%
  mutate(Gene.Names = sapply(Gene.Names, remove_duplicates))

#now separate all gene.names (multiple gene.names per row present) for better merging
proteomics_named <- proteomics_named %>%
  separate_rows(Gene.Names, sep = " ")