I need to align a dataset mapped to GRCh38.p2 (ensembl 79) and a dataset mapped to GRCh38.p13 (ensembl 98). The first dataset (ensembl 79) has gene names and entrez IDs. The second dataset (ensembl 98) has gene names and ENSG IDs. I want to convert ensembl 79 entrez IDs to ENSG IDs. When I query on biomaRt, almost half of the genes are not found. I have tried using both "external_gene_name" and "enterezgene" as filters. I have tried using both the most recent mart and archived marts (ensembl 77-80).
FYI: approximately 25000 genes were not found, and of these genes about 10000 of them are pseudogenes.
Code below:
listEnsemblArchives()
biomart <- useMart("ensembl", host = "https://oct2014.archive.ensembl.org", dataset = "hsapiens_gene_ensembl")
filters <-listFilters(biomart)
attributes <- listAttributes(biomart)
m1.biomart <- getBM(filters = "entrezgene", attributes = c("ensembl_gene_id","entrezgene", "external_gene_name", "hgnc_symbol"), values = m1.entrez.ids$entrez_id, mart = biomart)
length(unique(m1.entrez.ids$entrez_id))
[1] 50281
length(unique(m1.biomart$entrezgene))
[1] 25987
length(unique(m1.biomart$ensembl_gene_id))
[1] 28701
It would be helpful if you can give some examples of Entrez IDs that are missing in the BioMart response.
I have a problem similar to danielcgingerich but for finding Entrez IDs from Ensembl IDs, in that my code is not giving me Entrez IDs for genes that do have them (when searched via https://www.ncbi.nlm.nih.gov/gene/); I'm not sure if this warrants its own post, so I ask it here.
In the example below, ENSMUSG00000000031 does not appear to have an Entrez ID, but searching through NCBI shows that it refers to H19.
Output of
head(mapping)
: