I'm trying to convert Ensembl IDs to Gene symbols within a summarized experiment object (more or less an expression matrix) using BioMart.
mart <- useDataset("hsapiens_gene_ensembl", useMart("ENSEMBL_MART_ENSEMBL"))
genes <- rownames(gse_cellgenefiltered_cohort1)
G_list <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id", "hgnc_symbol"),values=genes,mart= mart)
For some reason, there is a discrepancy between the number of Ensembl IDs I supply BioMart with and the number of Ensembl IDs it returns.
length(rownames(gse_cellgenefiltered_cohort1))
[1] 23395
length(G_list$ensembl_gene_id)
[1] 23316
Another thing I noticed, is that BioMart returns duplicated Ensembl IDs for some of them.
length(unique(G_list$ensembl_gene_id))
[1] 23314
I don't think there are any duplicated Ensembl IDs in the expression matrix.
length(unique(rownames(gse_cellgenefiltered_cohort1)))
[1] 23395
Would anyone know why this might be happening?
Hello, I am trying to convert RefSeq ids to gene symbols using the biomaRt R package. I followed the below script to align the input entries with the output. Surprisingly, I have provided 330655 RefSeq ids (Ensembl.ids$v1) and but biomart is giving me 344267 (merged$v1) RefSeq entries output. I am not sure what I am missing here. Please see the script here and help me figure out how this duplication of RefSeq and gene_name output can be fixed.
Please do not post screenshots (use text and
101
button to format that ascode
). Screenshots do not allow people to copy text for testing. No one is going to type things from a screenshot manually.Thanks. I have updated the post. Can I add input data if anyone wants to replicate the issue I am having?
You are missing, that the IDs are not 1:1 mappings. Refseq, Entrez Gene and Ensembl are separate corpuses of human genome annotations and sometimes one ID in Ensembl might map to multiple IDs in Refseq etc. In that case, Biomart will duplicate the particular ID and output two rows.
You can test that by running
dim(output.mappings[!duplicated(output.mappings),])
and e.g.dim(output.mappings[!duplicated(output.mappings$refseq_mrna),])
. The more attributes you request fromgetBM
, the more duplication you will generally see (if you request e.g. GO terms, you might get hundreds of rows perensembl.id
). How you deal with that downstream is up to you.If you wish to use perfectly harmonized mappings between Refseq and Ensembl, you need to restrict yourself to the MANE corpus.
Would you mind if you please provide the corrected script? I tried but it is not giving the harmonized mapping between Refseq and Ensembl
I already linked the MANE website above. Navigate to Accessing MANE data and download this file. It is an equivalence table and contains the information you are looking for:
Read it into R with
read.delim()
orfread()
or whatever function you prefer and use that asoutput.mappings
. Filter rows and rename columns as you see fit.