Question

BioMart dropping and duplicating Ensembl IDs while retrieving corresponding gene symbols?

3

Entering edit mode

3.9 years ago

skjw1029 ▴ 80

I'm trying to convert Ensembl IDs to Gene symbols within a summarized experiment object (more or less an expression matrix) using BioMart.

mart <- useDataset("hsapiens_gene_ensembl", useMart("ENSEMBL_MART_ENSEMBL"))
genes <- rownames(gse_cellgenefiltered_cohort1)
G_list <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id", "hgnc_symbol"),values=genes,mart= mart)

For some reason, there is a discrepancy between the number of Ensembl IDs I supply BioMart with and the number of Ensembl IDs it returns.

length(rownames(gse_cellgenefiltered_cohort1))

[1] 23395

length(G_list$ensembl_gene_id)

[1] 23316

Another thing I noticed, is that BioMart returns duplicated Ensembl IDs for some of them.

length(unique(G_list$ensembl_gene_id))

[1] 23314

I don't think there are any duplicated Ensembl IDs in the expression matrix.

length(unique(rownames(gse_cellgenefiltered_cohort1)))

[1] 23395

Would anyone know why this might be happening?

ensembl BioMart • 5.7k views

ADD COMMENT • link updated 2.5 years ago by Matthias Zepper 5.1k • written 3.9 years ago by skjw1029 ▴ 80

0

Entering edit mode

Hello, I am trying to convert RefSeq ids to gene symbols using the biomaRt R package. I followed the below script to align the input entries with the output. Surprisingly, I have provided 330655 RefSeq ids (Ensembl.ids$v1) and but biomart is giving me 344267 (merged$v1) RefSeq entries output. I am not sure what I am missing here. Please see the script here and help me figure out how this duplication of RefSeq and gene_name output can be fixed.

enter image description here

library(biomaRt)
library(annotables)
library(org.Hs.eg.db)
library(EnsDb.Hsapiens.v86)
library(tidyverse)
library(dplyr)
ensembl.ids <- read.delim('NMTO GENE.txt', header = F)

listEnsembl()
ensembl <- useEnsembl(biomart = "genes")
datasets <- listDatasets(ensembl)
ensembl.con <- useMart("ensembl", dataset = 'hsapiens_gene_ensembl')
attr <- listAttributes(ensembl.con)
filters <- listFilters(ensembl.con)
output.mappings <- getBM(attributes = c('external_gene_name','refseq_mrna', 'entrezgene_id'),
                         filters = "refseq_mrna",
                         values = ensembl.ids$V1,
                         mart = ensembl.con,
                         uniqueRows = TRUE)
merged <- merge(x = ensembl.ids, y = output.mappings, by.x="V1",by.y="refseq_mrna")
write.csv(merged, "shGFP.Features.csv")

ADD REPLY • link 2.5 years ago by mbansal • 0

1

Entering edit mode

Please do not post screenshots (use text and 101 button to format that as code). Screenshots do not allow people to copy text for testing. No one is going to type things from a screenshot manually.

ADD REPLY • link 2.5 years ago by GenoMax 151k

0

Entering edit mode

Thanks. I have updated the post. Can I add input data if anyone wants to replicate the issue I am having?

ADD REPLY • link 2.5 years ago by mbansal • 0

1

Entering edit mode

You are missing, that the IDs are not 1:1 mappings. Refseq, Entrez Gene and Ensembl are separate corpuses of human genome annotations and sometimes one ID in Ensembl might map to multiple IDs in Refseq etc. In that case, Biomart will duplicate the particular ID and output two rows.

You can test that by running dim(output.mappings[!duplicated(output.mappings),]) and e.g. dim(output.mappings[!duplicated(output.mappings$refseq_mrna),]). The more attributes you request from getBM, the more duplication you will generally see (if you request e.g. GO terms, you might get hundreds of rows per ensembl.id). How you deal with that downstream is up to you.

If you wish to use perfectly harmonized mappings between Refseq and Ensembl, you need to restrict yourself to the MANE corpus.

ADD REPLY • link 2.5 years ago by Matthias Zepper 5.1k

0

Entering edit mode

Would you mind if you please provide the corrected script? I tried but it is not giving the harmonized mapping between Refseq and Ensembl

ADD REPLY • link 2.5 years ago by mbansal • 0

0

Entering edit mode

I already linked the MANE website above. Navigate to Accessing MANE data and download this file. It is an equivalence table and contains the information you are looking for:

>     [  1] NCBI_GeneID
>     [  2] Ensembl_Gene
>     [  3] HGNC_ID
>     [  4] symbol
>     [  5] name
>     [  6] RefSeq_nuc
>     [  7] RefSeq_prot
>     [  8] Ensembl_nuc
>     [  9] Ensembl_prot
>     [ 10] MANE_status
>     [ 11] GRCh38_chr
>     [ 12] chr_start
>     [ 13] chr_end
>     [ 14] chr_strand

Read it into R with read.delim() or fread() or whatever function you prefer and use that as output.mappings. Filter rows and rename columns as you see fit.

ADD REPLY • link 2.5 years ago by Matthias Zepper 5.1k

1

Entering edit mode

3.9 years ago

rodolfo.peacewalker ▴ 390

There are some issues when using BiomaRt to retrieve or convert your id's. In your case, as Mike replied you, you lost some information because some genes lack of hgnc symbol. To solve this problem I suggest you to include the entrezgene_id value in the attributes to retrieve as well as the gene_biotype. On the other hand, use the left_join() function from dplyr to merge your query with your converted id's in order to preserve your original gene id's. However, I don't know whether summarized experiment objects allow you to make that kind of operations.

Respect to the duplicated id's, it could be associated to the biotype of your genes. In my experience, lncRNA. sometimes present two hgnc symbol. Another way to solve this problem is my using the distinct() function from dplyr. Here is a snippet of what you can do:

G_list <- distinct(G_list, ensembl_gene_id, .keep_all = TRUE)

Best regards!

ADD COMMENT • link 3.9 years ago by rodolfo.peacewalker ▴ 390

score 4 · Accepted Answer · 2021-07-22

There's a few things that might be going on, and it's hard to tell exactly without some examples of the missing or duplicated gene IDs. Here's some ideas though.

BioMart will silently drop any element in values that aren't found in the query. There's no error or anything, you just don't get a hit. That's easy to see with a single value, harder to spot in 23,000:

## query not found in Ensembl
getBM(values = c("ENSG_NOT_REAL"),
      filter = "ensembl_gene_id",
      attributes = c("ensembl_gene_id", "hgnc_symbol"),
      mart = mart)
#> [1] ensembl_gene_id hgnc_symbol    
#> <0 rows> (or 0-length row.names)

You can try to identify what input values aren't returned in the results with something like genes[ !genes %in% G_list$ensembl_gene_id ]. If that finds something I'd search the Ensembl website manually with a few of the IDs and try to understand why they might be missing from BioMart e.g. they might be from an old Ensembl version and have been retired - there are probably many possible reasons.

For completeness I'll also point out that Ensembl BioMart will ignore duplicate entries in the the values argument e.g..

## duplicated input values
getBM(values = c("ENSG00000010404", "ENSG00000010404"),
      filter = "ensembl_gene_id",
      attributes = c("ensembl_gene_id", "hgnc_symbol"),
      mart = mart)
#>   ensembl_gene_id hgnc_symbol
#> 1 ENSG00000010404         IDS

However it looks like you've already checked this isn't the case in your data.

Regarding the duplicated entries in the results, this can occur if there is a one-to-many mapping between the two ID types you're trying to find e.g.

## one-to-many mapping
getBM(values = "ENSG00000277796",
      filter = "ensembl_gene_id",
      attributes = c("ensembl_gene_id", "hgnc_symbol"),
      mart = mart)
#>   ensembl_gene_id hgnc_symbol
#> 1 ENSG00000277796      CCL3L3
#> 2 ENSG00000277796      CCL3L1

Mapping between IDs from different organisations is never perfect and it's pretty common to see instances like this, where a single Ensembl ID maps to two HGNC IDs (or vice versa). You could try to identify the duplicated entries with

G_list[ duplicated(G_list$ensembl_gene_id) | duplicated(G_list$ensembl_gene_id, fromLast = TRUE), ]