Hello,
I have Myeloid-Acute Myeloid Leukemia (AML) RNAseq data file data_mrna_seq_rpkm.csv
. This file has Hugo_Symbols for all 22,844 genes but not its Entrez IDs. I was able use to two methods in R programming 1) org.Hs.eg.db::mapIDs
method and 2) biomaRt method to get the entrez_ID of only 16,569 genes from their respective hugo symbols and got 'NA' values for the entrez id of rest 6,275 Hugo symbols. How can I replace the 'NA' values with their respective entrez IDs? Please guide me to get the Entrez Ids of all 22,844 genes (Hugo_symbols).
1. using library(org.Hs.eg.db) mapIDs method
library(org.Hs.eg.db)
# Read your CSV data
data <- read.csv("data_mrna_seq_rpkm.csv", stringsAsFactors=FALSE)
# Get the mapping
entrez_ids <- mapIds(org.Hs.eg.db,
keys = data$Hugo_Symbol,
column = "ENTREZID",
keytype = "SYMBOL",
multiVals = "first")
# Add the Entrez IDs to your data frame
data$Entrez_Gene_Id <- entrez_ids
write.csv(data, "updated_data_mrna_seq_rpkm.csv", row.names = FALSE)
2. Using biomaRT
# Install and load necessary libraries
install.packages("biomaRt")
library(biomaRt)
library(readr)
# Read the CSV file
df <- read_csv("updated_data_mrna_seq_rpkm.csv")
# Connect to the Ensembl BioMart database
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
# Get the Hugo Symbols with NA Entrez_Gene_Id
hugo_symbols_with_na <- df$Hugo_Symbol[is.na(df$Entrez_Gene_Id)]
listAttributes(mart)
# Fetch their Entrez IDs from BioMart
genes <- getBM(attributes = c('hgnc_symbol', 'entrezgene_id'),
filters = 'hgnc_symbol',
values = hugo_symbols_with_na,
mart = mart)
# Replace NA values in the original dataframe with fetched Entrez IDs
for(i in 1:nrow(genes)) {
mask <- df$Hugo_Symbol == genes$hgnc_symbol[i] & is.na(df$Entrez_Gene_Id)
df$Entrez_Gene_Id[mask] <- genes$entrezgene[i]
}
# Save the updated dataframe back to CSV
write_csv(df, "updated_data_mrna_seq_rpkm_updated.csv")
Can you provide some examples of HUGO ID's you are unable to convert?
Yes sure. These are all gene IDs BZRAP1, C19orf60, TCEB3 and so on.
Using
EntrezDirect
(LINK):