There's a few things that might be going on, and it's hard to tell exactly without some examples of the missing or duplicated gene IDs. Here's some ideas though.
BioMart will silently drop any element in values
that aren't found in the query. There's no error or anything, you just don't get a hit. That's easy to see with a single value, harder to spot in 23,000:
## query not found in Ensembl
getBM(values = c("ENSG_NOT_REAL"),
filter = "ensembl_gene_id",
attributes = c("ensembl_gene_id", "hgnc_symbol"),
mart = mart)
#> [1] ensembl_gene_id hgnc_symbol
#> <0 rows> (or 0-length row.names)
You can try to identify what input values aren't returned in the results with something like genes[ !genes %in% G_list$ensembl_gene_id ]
. If that finds something I'd search the Ensembl website manually with a few of the IDs and try to understand why they might be missing from BioMart e.g. they might be from an old Ensembl version and have been retired - there are probably many possible reasons.
For completeness I'll also point out that Ensembl BioMart will ignore duplicate entries in the the values
argument e.g..
## duplicated input values
getBM(values = c("ENSG00000010404", "ENSG00000010404"),
filter = "ensembl_gene_id",
attributes = c("ensembl_gene_id", "hgnc_symbol"),
mart = mart)
#> ensembl_gene_id hgnc_symbol
#> 1 ENSG00000010404 IDS
However it looks like you've already checked this isn't the case in your data.
Regarding the duplicated entries in the results, this can occur if there is a one-to-many mapping between the two ID types you're trying to find e.g.
## one-to-many mapping
getBM(values = "ENSG00000277796",
filter = "ensembl_gene_id",
attributes = c("ensembl_gene_id", "hgnc_symbol"),
mart = mart)
#> ensembl_gene_id hgnc_symbol
#> 1 ENSG00000277796 CCL3L3
#> 2 ENSG00000277796 CCL3L1
Mapping between IDs from different organisations is never perfect and it's pretty common to see instances like this, where a single Ensembl ID maps to two HGNC IDs (or vice versa). You could try to identify the duplicated entries with
G_list[ duplicated(G_list$ensembl_gene_id) | duplicated(G_list$ensembl_gene_id, fromLast = TRUE), ]
Hello, I am trying to convert RefSeq ids to gene symbols using the biomaRt R package. I followed the below script to align the input entries with the output. Surprisingly, I have provided 330655 RefSeq ids (Ensembl.ids$v1) and but biomart is giving me 344267 (merged$v1) RefSeq entries output. I am not sure what I am missing here. Please see the script here and help me figure out how this duplication of RefSeq and gene_name output can be fixed.
Please do not post screenshots (use text and
101
button to format that ascode
). Screenshots do not allow people to copy text for testing. No one is going to type things from a screenshot manually.Thanks. I have updated the post. Can I add input data if anyone wants to replicate the issue I am having?
You are missing, that the IDs are not 1:1 mappings. Refseq, Entrez Gene and Ensembl are separate corpuses of human genome annotations and sometimes one ID in Ensembl might map to multiple IDs in Refseq etc. In that case, Biomart will duplicate the particular ID and output two rows.
You can test that by running
dim(output.mappings[!duplicated(output.mappings),])
and e.g.dim(output.mappings[!duplicated(output.mappings$refseq_mrna),])
. The more attributes you request fromgetBM
, the more duplication you will generally see (if you request e.g. GO terms, you might get hundreds of rows perensembl.id
). How you deal with that downstream is up to you.If you wish to use perfectly harmonized mappings between Refseq and Ensembl, you need to restrict yourself to the MANE corpus.
Would you mind if you please provide the corrected script? I tried but it is not giving the harmonized mapping between Refseq and Ensembl
I already linked the MANE website above. Navigate to Accessing MANE data and download this file. It is an equivalence table and contains the information you are looking for:
Read it into R with
read.delim()
orfread()
or whatever function you prefer and use that asoutput.mappings
. Filter rows and rename columns as you see fit.