Conversion of Gene Name to Ensembl ID
4
0
Entering edit mode
5 months ago

Hi,

I have a problem with gene names in my dataset. Some genes have different names. For example, the gene ENSG00000142513 is called ACPT in my dataset, but common mapping libraries call it ACP4.

I have looked at several discussions on BioStar and found posts with similar issues. However, even after trying the suggested solutions, I haven't been able to fix my problem.

I am seeking advice or tools to correctly map these gene names to their Ensembl IDs or standardize the gene names across different datasets.

Thanks

Ensembl RNA-Seq • 1.1k views
ADD COMMENT
0
Entering edit mode

Have a look at geneSynonym

ADD REPLY
0
Entering edit mode

I have looked at several discussions on BioStar and found posts with similar issues. However, even after trying the suggested solutions, I haven't been able to fix my problem.

Show us what you've tried and the problems you ran into. Any suggestion in this post will be a repeat of one of those methods and unless you tell us your problems in detail, we cannot suggest a viable way for you to map one form of gene identifier to another.

ADD REPLY
3
Entering edit mode
5 months ago
LauferVA 4.5k

Posting this as a separate answer because specifying a software solution may solve a procedural problem without addressing gaps in future readers' understanding regarding databases of "gene names".

Problem 1: Ambiguity of the term "Gene Name" and the Importance of Context

There is no "right" or "wrong" gene name unless a target database is specified. Here, the term "gene name" is ambiguous for that reason: until we know what exactly what the target database is, the "correct" answer may vary. So, to help you, we need to be as specific regarding the target database as you are about the source identifier (which you do a good job of specifying - the Ensembl gene ID ENSG00000142513).

Ask yourself this - how does a reader here at Biostars know which of these you want:

  • Official Gene Symbol according to HGNC
  • NCBI Gene / RefSeq
  • UniProt KB gene name
  • GeneCards gene name
  • The gene name according to some specialty database (KEGG, GTEx, PANTHER, Gencode, Reactome, OMIM, the list is long ...)

It may seem like this is nit-picking, but the issue is not trivial or pedantic: the best answer could change if we know you are a genetic counselor and OMIM is important to the clinical report you are writing ... but the reader depends on you to know that.

Problem 2: Even if you specify a specific target database, the database version matters

Even if you unambiguously identify both the source identifier and the desired target identifier, you may still correctly (!!!) generate ACPT instead of ACP4, or vice versa, depending on the version of the database used.

Let's return to the clinical example. Clinical testing software is frequently very old and out of date because the regulatory approval process for clinical testing is cumbersome. So, suppose your version of R is old, and the software package hasn't been updated in years to prevent re-validation. In this case, even if the correct term for the gene is now ACP4, the software may correctly (!!!) return ACPT because, according to the db version being used (whether knowingly or unknowingly), that is the correct term.

ADD COMMENT
2
Entering edit mode
5 months ago
BioinfGuru ★ 2.1k

In the biomaRt R-package you will be able to match ensembl id, hgnc gene symbol, external gene name, and entrez id. Just make sure you select the same ensembl version as the genome used for mapping.

ADD COMMENT
0
Entering edit mode

Thank you for the suggestion. I have tried using the biomaRt R-package as you recommended, but unfortunately, it did not resolve the issue for my specific example. I would greatly appreciate any additional advice or alternative tools that might help standardize the gene names or accurately map them to their Ensembl IDs. Thank you for your assistance.

ADD REPLY
0
Entering edit mode

ENSG00000142513

ACPT is a discontinued HGNC symbol, replaced by ACP4. It is the same gene. So in this instance, you can overwrite ACPT with ACP4. But I would investigate how an old symbol entered your dataset in case there is something else going on.

ADD REPLY
0
Entering edit mode

This is a good solution. Posted a supplemental answer below.

ADD REPLY
0
Entering edit mode
5 months ago
dsull ★ 7.0k

Some other things you might find useful:

ADD COMMENT
0
Entering edit mode
5 months ago
# this code does the opposite, maybe helps..

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("biomaRt")
library("biomaRt")

# to have more information: browseVignettes("biomaRt")
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))

samples <- c("control1", "control2", "control3", "control4", "cancer1", "cancer2", "cancer3", "cancer4", "cancer5", "cancer6", "cancer7", "cancer8", "cancer9", "cancer10", "cancer11", "cancer12", "cancer13", "cancer14")

for (sample in samples) {
  pancreas <- read_delim(paste0("pancreas.", sample, ".FPKM.txt"), col_names = F, delim = "\t")
  names(pancreas) <- c("ensembl_ID", "FPKM")

  # in case it's gencode, this mostly works
  #if ensembl, will leave it alone

  pancreas$ensembl_ID <- sub("[.][0-9]*","", pancreas$ensembl_ID)

  gene_IDs <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id","hgnc_symbol"),
                  values = pancreas$ensembl_ID, mart= mart)
  names(gene_IDs) <- c("ensembl_ID", "gene_ID")
  pancreas <- left_join(pancreas, gene_IDs, by = "ensembl_ID")
  write_delim(pancreas, paste0("pancreas.", sample, ".FPKM_with_geneID.tsv"), delim = "\t")
}
ADD COMMENT
0
Entering edit mode

This code will not run on anyone else's machine. Here's a snippet derived from your code that will:

ensembl_ids <- c( ... ) #vector of ensembl IDs
ensembl_ids <- sub("[.][0-9]*","", ensembl_ids)

 gene_IDs <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id","hgnc_symbol"),
                  values = ensembl_ids, mart= mart)
names(gene_IDs) <- c("ensembl_ID", "gene_ID")
write_delim(gene_IDs, "ensembl-hgnc.mapped.tsv", delim = "\t")
ADD REPLY

Login before adding your answer.

Traffic: 1815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6