How to deal with the missing values after converting genes to Entrez Gene ID using BioMart?
1
1
Entering edit mode
5.2 years ago
lihe.liu ▴ 30

Hi,

I was trying to convert a list of genes to EntrezID, in my list, I have either enternal_gene_name or entrezgene_accession, and I need to convert them to EntrezID level.

I tried BioMart:

 library(biomaRt)

ensembl = useMart("ENSEMBL_MART_ENSEMBL",dataset="btaurus_gene_ensembl") # I work with **Bos taurus**

attributes = c("external_gene_name","entrezgene_accession","entrezgene_id")

gene = getBM(attributes=attributes, mart = ensembl)

test = c("ZNF567", "MILR1")

gene[gene$external_gene_name %in% test,] # cannot get EntrezID here

And it works fine.

My pain is that I always have a few genes with no EntrezID returned, while I look them up, manually, in NCBI (https://www.ncbi.nlm.nih.gov/), I actually can get the corresponding EntrezID.

For example, "ZNF567", "MILR1" has no entrezID matches from the R codes while they do have EntrezID (https://www.ncbi.nlm.nih.gov/gene/532421) (https://www.ncbi.nlm.nih.gov/gene/789682)

I wonder what could possibly be the problem. And If there are other ways can achieve the conversion seamlessly.

I have Mac 10.14.6.

Thank you so much!

Best

R biomart • 3.2k views
ADD COMMENT
2
Entering edit mode

Most likely this is because these are either not present in Ensembl or the Ensembl genes under these names do not match those with the same name in Entrez.

ADD REPLY
0
Entering edit mode

Thank you so much!!!

ADD REPLY
1
Entering edit mode
5.2 years ago
Emily 24k

Just to expand on what Jean-Karim said, BioMart does not map identifiers all to all. It maps all identifiers to Ensembl genes, so when you map gene name->Entrez, or vice versa, you're actually mapping gene name->Ensembl gene->Entrez. This means that if the mapping between an identifier and Ensembl is missing, you won't get the mapping between the two databases.

ADD COMMENT
0
Entering edit mode

That makes sense to me! Thank you so much!

ADD REPLY
0
Entering edit mode

Hi Emily,

Quick question, in terms of GO records (Bos taurus), are Ensembl database and org.Bt.eg.db database very different?

I was trying to get all the GO involved in my species, but these two give me quite different numbers of GO. Is this normal?

What is a good way to get all the GO records of a species? Thank you so much!

Best

ADD REPLY
0
Entering edit mode

If by org.Bt.eg.db you mean the R Bioconductor package, then this package uses Entrez gene and Genbank identifiers. According to the doc, the latest version at this time uses data from 26 April 2019. Given that NCBI and Ensembl have different annotations processes (different gene sets, different ways of associating GO terms...), it's not surprising that you get different results if you use one or the other resource. To get all GO terms associated with genes in Ensembl, you could use the perl API. Below is a piece of code I used for that purpose some time ago:

use warnings;
use strict;
use Bio::EnsEMBL::Registry;

# Ensembl databases
my $registry = "Bio::EnsEMBL::Registry";
$registry->load_all("$ENV{'HOME'}/.ensembl_init");
my $dbh = Bio::EnsEMBL::Registry->get_DBAdaptor('Homo sapiens', 'core');

my $dbname = $dbh->dbc->dbname();
my ($version) = $dbname=~/core_(\d+)_/i;

# Get all protein coding genes
print STDERR "Getting all protein coding genes...";
my $ga = $dbh->get_GeneAdaptor();
my @genes = @{$ga->fetch_all_by_biotype('protein_coding')};
print STDERR "Done.\n";
print STDERR "Getting GO annotations...";
open OUT,">","Ensembl".$version."_annotations.txt" or die "\nERROR: Can't create file Ensembl".$version."_annotations.txt: $!";
foreach my $gene(@genes) {
  my @GO_annots = @{$gene->get_all_DBLinks("GO")};
  foreach my $GO_term(@GO_annots) {
    print OUT $gene->stable_id,"\t",$GO_term->display_id,"\n";
  }
}
close OUT;
print STDERR "Done.\n";
exit(0);
ADD REPLY

Login before adding your answer.

Traffic: 1547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6