Question

Mapping symbols to ensembl Ids using mapIds() - returning NA, multiVals and what to do with Human Alternative sequence Genes

0

Entering edit mode

4.5 years ago

jack.henry ▴ 50

I am trying to run some fgsea on TCGA data using the genesets from the Molecular Signatures Database. I have downloaded the .gmt symbols file and then use mapIds() from AnnotationDbi to convert the symbols to ensembl ids that I have in the TCGA data.

1: My first problem is that rarely but sometimes mapIds returns NA to some genes and I am not sure why becuase when I search them on ensembl.org they do have an enembl id. Is this something to do with transcript ids? Is there a way to fix this? I can use gmtFile[gmtFile == "AC093012.1"] <- NA to temporarily get around the problem but I know this is not best practice and I would love it if somebody has a solution.

2: My other problem is that when I test wether the ensembl ids are in the TCGA data I sometimes find that they are not there. I have noticed that this usually happens when these genes are Human Alternative sequence Gene or at least have Human Alternative sequence Gene as well as the regular Human Gene. Again I can delete the gene from the geneset as it is not in the dataset but is this okay to do?

3: My final question is that mapIds() often returns 1:many mapping between keys and columns. I guess this is because the symbols will often have multiple ensembl id. I have been using multiVals = "first" to just get the first ensembl id for that gene, but is this okay or should i be extending the geneset to create extra genes for the multiple ensembl ids?

Examples of genes that mapIds() returns NA:

AC093012.1: ENSG00000257896
HBBP1: ENSG00000229988
MIR1-2: ENSG00000284453
MIR19B1: ENSG00000284375
MIR19B2: ENSG00000284107
MIR29B1: ENSG00000284203
MIR29B2: ENSG00000284203
MIR665: ENSG00000283159
SHLD2P3: ENSG00000189014
MEIS3P1: ENSG00000179277
C7ORF50: ENSG00000146540
C1ORF109: ENSG00000116922
C1ORF115: ENSG00000162817
CXORF38: ENSG00000185753
CSF2RBP1: ENSG00000232254
C1ORF174: ENSG00000198912
VENTXP7: ENSG00000236380
RBMS1P1: ENSG00000225422
FAM182B: ENSG00000175170
RBMY2AP: ENSG00000226092

Examples of genes that I cant find in the TCGA dataset:

HLA-DRB4: ENSG00000227357/ ENSG00000227826/ ENSG00000231021
HLA-DRB3: ENSG00000230463/ ENSG00000231679/ ENSG00000196101
C4B_2: ENSG00000233312
MUC2: ENSG00000198788 / ENSG00000278466/ ENSG00000284971

MapIds function:

library(org.Hs.eg.db)
library(AnnotationDbi)
mapIds(
          x = org.Hs.eg.db, 
          keys = currentgeneset, 
          "ENSEMBL", 
          "SYMBOL",
          fuzzy = TRUE,
          multiVals = "first")

I know these questions have been asked a lot on here but I can't seem to find the answers I'm after.

Thanks in advance for any help!!

RNA-Seq TCGA R ensembl gsea • 7.7k views

ADD COMMENT • link 4.5 years ago by jack.henry ▴ 50