Question

biomaRt not finding all genes

0

Entering edit mode

4.0 years ago

danielcgingerich ▴ 10

I need to align a dataset mapped to GRCh38.p2 (ensembl 79) and a dataset mapped to GRCh38.p13 (ensembl 98). The first dataset (ensembl 79) has gene names and entrez IDs. The second dataset (ensembl 98) has gene names and ENSG IDs. I want to convert ensembl 79 entrez IDs to ENSG IDs. When I query on biomaRt, almost half of the genes are not found. I have tried using both "external_gene_name" and "enterezgene" as filters. I have tried using both the most recent mart and archived marts (ensembl 77-80).

FYI: approximately 25000 genes were not found, and of these genes about 10000 of them are pseudogenes.

Code below:

listEnsemblArchives()
biomart <- useMart("ensembl", host = "https://oct2014.archive.ensembl.org", dataset = "hsapiens_gene_ensembl")
filters <-listFilters(biomart)
attributes <- listAttributes(biomart)

m1.biomart <- getBM(filters = "entrezgene", attributes = c("ensembl_gene_id","entrezgene", "external_gene_name", "hgnc_symbol"), values = m1.entrez.ids$entrez_id,  mart = biomart)

length(unique(m1.entrez.ids$entrez_id))
[1] 50281

length(unique(m1.biomart$entrezgene))
[1] 25987

length(unique(m1.biomart$ensembl_gene_id))
[1] 28701

genome assembly id mapping biomaRt alignment • 2.6k views

ADD COMMENT • link 4.0 years ago by danielcgingerich ▴ 10

0

Entering edit mode

It would be helpful if you can give some examples of Entrez IDs that are missing in the BioMart response.

ADD REPLY • link 4.0 years ago by Mike Smith ★ 2.1k

0

Entering edit mode

I have a problem similar to danielcgingerich but for finding Entrez IDs from Ensembl IDs, in that my code is not giving me Entrez IDs for genes that do have them (when searched via https://www.ncbi.nlm.nih.gov/gene/); I'm not sure if this warrants its own post, so I ask it here.

In the example below, ENSMUSG00000000031 does not appear to have an Entrez ID, but searching through NCBI shows that it refers to H19.

library(biomaRt)
c <- ('ENSMUSG00000000001', 'ENSMUSG00000000003', 'ENSMUSG00000000028', 'ENSMUSG00000000031', 'ENSMUSG00000000037') # this is just a sample
mmusmart <- useMart(dataset = "mmusculus_gene_ensembl", biomart = "ensembl")
mapping <- getBM(
   attributes = c('ensembl_gene_id', 'entrezgene_id', 'entrezgene_accession'), 
   filters = 'ensembl_gene_id',
   values = ensemblIDs,
   mart = mmusmart)

Output of head(mapping):

ensembl_gene_id entrezgene_id   entrezgene_accession
<chr>   <int>   <chr>
1   ENSMUSG00000000001  14679   Gnai3
2   ENSMUSG00000000003  54192   Pbsn
3   ENSMUSG00000000028  12544   Cdc45
4   ENSMUSG00000000031  NA  
5   ENSMUSG00000000037  107815  Scml2

ADD REPLY • link 4.0 years ago by AndRewster • 0