Question

Single Cell Gene Count Matrix with Ensembl IDs as Rownames. Need to convert to Gene Names.

2

Entering edit mode

2.0 years ago

achamess ▴ 90

I know this question has been asked in various iterations before, and it seems straightforward but I can't figure out how to get it to work. I've tried various things, spent a lot of time.

I have a gene count matrix with cells as columns and rownames are Ensembl IDs for mouse.

                [,1]
ENSMUSG00000104352    0
ENSMUSG00000104046    0
ENSMUSG00000102907    0
ENSMUSG00000025905    0
ENSMUSG00000103936    0
ENSMUSG00000093015    0

I tried something like this

rownames(counts) <- mapIds(org.Mm.eg.db,keys=rownames(counts),column="SYMBOL",keytype="ENSEMBL",multiVals="first")

But the issue I run into is that I get many NAs because not every Ensembl ID maps to a Gene Name. Also, for some Gene Names, multiple Ensembl IDs map.

So if I run the code above, I get this output:

       [,1]
<NA>       0
Gm26206    0
Xkr4       0
Gm18956    0
<NA>       0
<NA>       0
<NA>       0
<NA>       0
<NA>       0
Gm7341     0

I saw this response, to keep Ensembl IDs if NA, but it didn't work because some gene names are duplicated and the matrix can't have duplicate row names.

R: converting Ensembl row names to Symbol ID outputs missing values in 'row.names' are not allowed

Can someone point me in the right direction on how to deal with the NAs and duplicates?

The goal is to replace the rownames with Gene Names, so when I do my downstream Seurat work, I don't have to keep looking up Ensembl IDs

ensembl single_cell genomics • 3.3k views

ADD COMMENT • link updated 2.0 years ago by rpolicastro 13k • written 2.0 years ago by achamess ▴ 90

0

Entering edit mode

because not every Ensembl ID is unique.

You have duplicate ID's in your matrix?

ADD REPLY • link 2.0 years ago by GenoMax 147k

0

Entering edit mode

Sorry. I'll change the phrasing. Every Ensembl ID is unique but multiple Ensembl IDs map to the same gene name.

ADD REPLY • link 2.0 years ago by achamess ▴ 90

0

Entering edit mode

You may want to take a look at Multiple ensembl gene ID for the same gene name (Symbol), how to deal with this while differential analysis? and comments/links within.

ADD REPLY • link 2.0 years ago by GenoMax 147k

score 2 · Answer 1 · 2022-11-16

2

Entering edit mode

2.0 years ago

rpolicastro 13k

assuming you have a data.frame df that has a gene_name and gene_id column, you could use the gene name if it exists, and there are no duplicates gene names, or else use the gene id.

dup_genes <- names(table(df$gene_name)[table(df$gene_name) > 1])

df$feature <- ifelse(is.na(df$gene_name) | df$gene_name %in% dup_genes, df$gene_id, df$gene_name)
rownames(df) <- df$gene_id

rownames(counts) <- df[rownames(counts), ]$feature

ADD COMMENT • link 2.0 years ago by rpolicastro 13k

0

Entering edit mode

Thank you for putting me out of my misery :D Good to see the approach. It worked. Here is my complete code. Made a few changes.

counts_df <- as.data.frame(counts)

counts_df$gene_name <- mapIds(org.Mm.eg.db,keys=rownames(counts),column="SYMBOL",keytype="ENSEMBL",multiVals="first")

counts_df$gene_id <- rownames(counts_df)

dup_genes <- counts_df[duplicated(counts_df$gene_name),]

counts_df$feature <- ifelse((is.na(counts_df$gene_name) | counts_df$gene_name %in% dup_genes), counts_df$gene_id, counts_df$gene_name)

rownames(counts) <- counts_df[rownames(counts), ]$feature

ADD REPLY • link 2.0 years ago by achamess ▴ 90