Entering edit mode
3.0 years ago
Zahra
▴
110
Hi all, I have raw counts of samples in a dataframe. The row names is Ensembl ID and I want to convert them to a gene symbol. So I’ve run the code below.
query <- GDCquery(project = "TCGA-COAD" ,
data.category = "Transcriptome Profiling" ,
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts" ,
sample.type = c("Primary Tumor", "Solid Tissue Normal"),
experimental.strategy = "RNA-Seq")
GDCdownload(query)
query.counts.colon <- GDCprepare(query)
ColonMatrix <- as.data.frame(SummarizedExperiment::assay(query.counts.colon ))
ens <- row.names(ColonMatrix)
> length(ens)
[1] 56602
#Ensembl id converting
require(org.Hs.eg.db)
ens_to_symbol <- mapIds(
org.Hs.eg.db,
keys = ens,
column = 'SYMBOL',
keytype = 'ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', useMart('ensembl'))
ens_to_symbol_biomart <- getBM(
filters = 'ensembl_gene_id',
attributes = c('ensembl_gene_id', 'hgnc_symbol'),
values = ens,
mart = mart)
ens_to_symbol_biomart <- merge(
x = as.data.frame(ens),
y = ens_to_symbol_biomart ,
by.y = 'ensembl_gene_id',
all.x = TRUE,
by.x = 'ens')
head(ens_to_symbol_biomart)
ens hgnc_symbol
1 ENSG00000000003 TSPAN6
2 ENSG00000000005 TNMD
3 ENSG00000000419 DPM1
4 ENSG00000000457 SCYL3
5 ENSG00000000460 C1orf112
6 ENSG00000000938 FGR
but when I check for duplicated gene symbols I found this :
>table(duplicated(ens_to_symbol_biomart$ hgnc_symbol))
FALSE TRUE
38446 18156
I don't know what is the reason for these duplicates. Should I remove these duplicated rows? Thanks for any help
Past threads that may be useful: Why am I getting different ensembl gene ids for a given gene symbol?
How to deal with the case that one gene symbol matches multiple ensembl ids?
You might have duplicates in your ensembl gene list. Or, there might be many ensembl IDs with blank gene IDs. I don't think there are 56k coding genes in the human genome.