1, via `org.Hs.eg.db`

Question

Translating gene names to entrez id's

1

Entering edit mode

4.1 years ago

Julian ▴ 10

Hi all,

I have many gene names I'm trying to map to entrez ids.

Right now I use the esearch module in biopython to query them 1 by 1 but this takes some time for 30000 gene names and ideally I would like it to be faster. I assume it would be faster if I could query 30000 at once instead of doing 30000 queries.

This is my current implementation:

for line in f.readlines():
            line = [lineitem.strip('"') for lineitem in line.strip().split()]
            gene = line[0]
            # Search NCBI for existing gene ids
            gene_id = None
            handle = Entrez.esearch(db="gene", term="Homo sapiens[orgn] AND "+ gene + "[Gene Name]")
            record = Entrez.read(handle)
            try: 
                gene_id = record["IdList"][0]
            except:
                pass
            handle.close()

this works but I would like a better solution. Is there a better way to approach this?

Kind regards, Julian

database annotation biopython gene python • 7.7k views

ADD COMMENT • link updated 17 months ago by Ram 45k • written 4.1 years ago by Julian ▴ 10

2

Entering edit mode

convert gene name to entrez id

ADD REPLY • link 4.1 years ago by patelk26 ▴ 320

0

Entering edit mode

The correct term for the identifiers you're calling "gene names" is HGNC Gene Symbols. HGNC also has "Gene Names", which are more like descriptions than one-word symbols.

ADD REPLY • link 4.1 years ago by Ram 45k

3

Entering edit mode

4.1 years ago

Kevin Blighe 89k

Hi,

Adding this more as a reference answer for future users.

genes <- c('BRCA1', 'BRCA2', 'BRCC3', 'ATM', 'TP53')

1, via `org.Hs.eg.db`

require(org.Hs.eg.db)
mapIds(
  org.Hs.eg.db,
  keys = genes,
  column = 'ENTREZID',
  keytype = 'SYMBOL')

BRCA1   BRCA2   BRCC3     ATM    TP53 
"672"   "675" "79184"   "472"  "7157" 


select(
  org.Hs.eg.db,
  keys = genes,
  column = c('SYMBOL', 'ENTREZID', 'ENSEMBL'),
  keytype = 'SYMBOL')

  SYMBOL ENTREZID         ENSEMBL
1  BRCA1      672 ENSG00000012048
2  BRCA2      675 ENSG00000139618
3  BRCC3    79184 ENSG00000185515
4    ATM      472 ENSG00000149311
5   TP53     7157 ENSG00000141510

2, via `biomaRt`

require(biomaRt)
ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl')

annot <- getBM(
  attributes = c(
    'hgnc_symbol',
    'external_gene_name',
    'ensembl_gene_id',
    'entrezgene_id',
    'gene_biotype'),
  filters = 'external_gene_name',
  values = genes,
  mart = ensembl)

annot <- merge(
  x = as.data.frame(genes),
  y =  annot,
  by.y = 'external_gene_name',
  all.x = T,
  by.x = 'genes')

annot
  genes hgnc_symbol ensembl_gene_id entrezgene_id   gene_biotype
1   ATM         ATM ENSG00000149311           472 protein_coding
2 BRCA1       BRCA1 ENSG00000012048           672 protein_coding
3 BRCA2       BRCA2 ENSG00000139618           675 protein_coding
4 BRCC3       BRCC3 ENSG00000185515         79184 protein_coding
5  TP53        TP53 ENSG00000141510          7157 protein_coding

Kevin

ADD COMMENT • link updated 17 months ago by Ram 45k • written 4.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, this helped a lot! I am in a situation where there is some keys mismatch between my TxDb and org.db object. I made it work using the mapIds() approach. I can't thank you enough. Take care

ADD REPLY • link 17 months ago by Wassim Salam • 0

0

Entering edit mode

4.1 years ago

vkkodali_ncbi ★ 3.8k

Adding NCBI Datasets as another option. Specifically, go to the Data Tables part of it and load all of the gene names either by copy/pasting the list or uploading your text file. You will be able to download either a gene-centric table or a transcript-centric table that can be opened in Excel, imported into R, etc, as well as sequence data if needed.

ADD COMMENT • link 4.1 years ago by vkkodali_ncbi ★ 3.8k

score 5 · Accepted Answer · 2021-03-25

Using EntrezDirect directly instead of via Python:

$ esearch -db gene -query "human [orgn]" | efetch -format tabular | awk -F "\t" '{OFS="\t"}{print $6,$3}'
Symbol  GeneID
LOC120893160    120893160
LOC120893158    120893158
LOC120893156    120893156
LOC120893154    120893154
LOC120893152    120893152
LOC120893150    120893150
LOC120893148    120893148
LOC120893146    120893146
LOC120893144    120893144

To get only live entries that don't have LOC in the name. (Remove part from grep on to get them all).

$ esearch -db gene -query "human [orgn]" | efetch -format tabular | awk -F "\t" '{OFS="\t"}{if ($5 == "live") print $6,$3}' | grep -v "LOC" | head -10
SLC17A6-DT  120883619
TTC12-DT    120883617
TPBGL-AS1   120883615
PATL1-DT    120883613
TP53    7157
IGSF22-AS1  120883618
EMSY-DT 120883616
CCDC90B-AS1 120883614
EGFR    1956
TNF 7124

1, via org.Hs.eg.db

2, via biomaRt

1, via `org.Hs.eg.db`

2, via `biomaRt`