If you have the mygene
library installed in Python, you could use the following Python script:
#!/usr/bin/env python
import sys
import mygene
mg = mygene.MyGeneInfo()
genes = []
for line in sys.stdin:
genes.append(line.strip())
for gene in genes:
result = mg.query(gene, scopes="symbol", fields=["ensembl"], species="human", verbose=False)
hgnc_name = gene
for hit in result["hits"]:
if "ensembl" in hit and "gene" in hit["ensembl"]:
sys.stdout.write("%s\t%s\n" % (hgnc_name, hit["ensembl"]["gene"]))
If you don't have mygene
installed and you want to install it, you could run the following:
$ pip install mygene
As an example, here are HGNC names of genes in a file called "hgnc.txt":
DDX26B
CCDC83
MAST3
RPL11
ZDHHC20
LUC7L3
SNORD49A
CTSH
ACOT8
The above script would give the following output:
$ ./map_hgnc_to_ensg.py < hgnc.txt
DDX26B ENSG00000225235
DDX26B ENSG00000165359
CCDC83 ENSG00000150676
MAST3 ENSG00000099308
RPL11 ENSG00000142676
ZDHHC20 ENSG00000180776
ZDHHC20 ENSG00000236953
LUC7L3 ENSG00000108848
SNORD49A ENSG00000277370
CTSH ENSG00000103811
ACOT8 ENSG00000101473
You could write the output to a text file like so:
$ ./map_hgnc_to_ensg.py < hgnc.txt > hgnc_mapped_to_ensg.txt
Note there is not a 1-to-1 correspondence between HGNC and Ensembl IDs. See the following post from Emily_Ensembl for discussion: Why am I getting different ensembl gene ids for a given gene symbol?