convert human gene names to ensembl ID
5
1
Entering edit mode
7.3 years ago
jehu ▴ 30

Hi,

I found a list of human differential expressed genes and i would like to convert them into ensembl ID, any idea how? or are there tools online? i cant find one yet

below is the gene list ADIPOQ(ACRP30),CYR61,NNMT,TCEAL7,PDK4,GJA1(CX43),EMP1,SAT1,AKR1C3,SCD,KLF4,PRKAR2B,MYH3,OGN,PERP,GADD45A,RGS2,ZFP36L1,MYH8,MEST,RBP4,DCLK1,EFCAB7,MGST1,PRUNE2,ANXA1,CD44,SNX7,MMD,ASPN,CFH,COL5A2,MLLT11,COL1A1,SOX4,CTGF,CHRNA1,OSBPL8,CDKN1A,LUM,EFEMP1,ANTXR1,ECM2,PHLDB2,FGL2,TUBB2A,EGR1,FST,COL21A1,SFRP2,ASPH,CYP26B1,MGP,ABI3BP,HMGB2,ABLIM1,EDIL3,ABCA8,H19,COL14A1,LHFP,BTG1,VCAN,PLIN1,SRPX,SKI,MEOX2,CALR,MARCKS,CCDC80,HACD2,ARHGAP36,ATP6AP2,TIMP2,PLAGL1,SH3BGRL,FABP4,ZFP36L2,MATN2,ELOVL5,RHOB,IGF2,NPNT,SESN3,ACTN1,CNN3,CITED2,NID1,EBF1,CHRDL1,RAPH1,ACTC1,CRISPLD1,MYH10,ELK3,C11orf96,CALD1,CAV2,IGF1,TCF7L2,MDM2,PRNP,FNDC1,TUBA1A,CYBRD1,PECAM1(CD31),DPT,SPIN1,MYADM,SORBS2,TUBB6,MAN1A1,ADAMTS1,FBLN1,PTGER4,HIF1A,COL5A1,NEAT1,WWTR1,TCEAL9,COL12A1,IGFBP4,ITGA9,NR2F2,CD302,VGLL3,LEP,ADD3,ENPP2,CLIC4,LAMB1,THBS4,KIDINS220,IL6ST(GP130),CCNG2,SRGN,DKK3,MYLK,COL6A2,PNMAL1,ITM2A,PPIB,BNIP3L,PHACTR2,SDC2,MSN,MEGF10,NOV (CCN3),CPE,IFI16,C1S,TMEM135,CILP,FOSB,PLS3,FTX,SH3BP5,ETS2,DPYSL2,CD81,CLSTN2,MCAM (CD146),HSP90B1,LAMA4,TGFBR3,TNC,PREPL,COL6A3,THRSP,MAP3K7CL,CXCL14,NOTCH2,SAMHD1,CTHRC1,SPARC,ACTR3,CDH11,SYNE3,ADIRF,FAM13A,SLC5A3,AOC3,FAM3C,YWHAQ,GPX3,TGFBR2,MTSS1,SLC38A1,GAS1,KANK1,KLF6,VIM,UCP2,SYNPO2,NFKBIZ,DUSP6,MAP1B,EPS8,HINT3,PPP1R12B,PRKAB2,ASB8,ACAP1,ASB2,MLF1,CISD1,ADGRD2,MAP2K7,SAR1B,NANOS1,MN1,FAM166B,FHL3,HRASLS,PHKG1,FBXO32,TCEA3,MLEC,KCNS3,PCYOX1,ADSSL1,SON,PGAM2,MAPK12,CCDC69,ATP2A2,ZEB1,UBXN4,SLC1A4,FBP2,DLG4,LPIN1,PHTF2,SSPN,FKBP5,MIR1-1HG,USO1,PEBP4,ATP2B2,HECTD1,PDP1,ITGB6,SLC16A3,OR7E47P,SRRM2,ZYG11B,MAP4,DHRS7C,NRAP,SAMD4A,MTUS1,NXPE3,IGFBP5,AQP4,PPP1R3A,TTN,CMYA5,

Thanks!

gene mapping ensembl • 18k views
ADD COMMENT
1
Entering edit mode

See the most comprehensive post covering ID conversion, Gene Id Conversion Tool Let me know if that works for you.

ADD REPLY
0
Entering edit mode

Thank you guys it worked!

ADD REPLY
0
Entering edit mode

Please use Add comment for comments and validate the answer that worked for you.

ADD REPLY
4
Entering edit mode
7.3 years ago
GenoMax 147k

You should be able to use BioMart to reverse the process described in this Ensembl help link.

ADD COMMENT
3
Entering edit mode
7.3 years ago
library(ensembldb)
library(EnsDb.Hsapiens.v79)
genes=c("CYR61")
genes(EnsDb.Hsapiens.v79, filter=list(GenenameFilter(genes),GeneIdFilter("ENSG", "startsWith")), return.type="data.frame", columns=c("gene_id"))

output:

   gene_id            gene_name
1 ENSG00000142871     CYR61
ADD COMMENT
0
Entering edit mode

Hi,

I have an error.

library(ensembldb)
library(EnsDb.Hsapiens.v79)
genes=c("CEACAM8", "CXCL1","CXCL2","CXCL5","CXCL8","MPO","ARG1","FCGR3B",
        "CCL20","CCL2","IL1RN","IL10","TGFB1","IL6","CCL3","CCL4")
genes(EnsDb.Hsapiens.v79, filter=list(GenenameFilter(genes),GeneIdFilter("ENSG", "startsWith")), 
      return.type="data.frame", columns=c("gene_id"))

Error in is(x, "list") : could not find function "GeneIdFilter"

ADD REPLY
1
Entering edit mode

Make sure that you have most recent R and R library versions. Clear the session and check if you have necessary packages loaded in the same order and make sure that there are no functions masked by other packages. Then try following in R:

library(ensembldb)
library(EnsDb.Hsapiens.v86)
genes=c("CEACAM8", "CXCL1","CXCL2","CXCL5","CXCL8","MPO","ARG1","FCGR3B",
        "CCL20","CCL2","IL1RN","IL10","TGFB1","IL6","CCL3","CCL4")
genes(EnsDb.Hsapiens.v86, filter=list(GenenameFilter(genes),GeneIdFilter("ENSG", "startsWith")), 
      return.type="data.frame", columns=c("gene_id"))

working session:

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_IN.UTF-8       LC_NUMERIC=C               LC_TIME=en_IN.UTF-8        LC_COLLATE=en_IN.UTF-8    
 [5] LC_MONETARY=en_IN.UTF-8    LC_MESSAGES=en_IN.UTF-8    LC_PAPER=en_IN.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_IN.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.2.2           AnnotationFilter_1.2.0    GenomicFeatures_1.30.3   
 [5] AnnotationDbi_1.40.0      Biobase_2.38.0            GenomicRanges_1.30.3      GenomeInfoDb_1.14.0      
 [9] IRanges_2.12.0            S4Vectors_0.16.0          BiocGenerics_0.24.0      

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.8.1    progress_1.1.2                lattice_0.20-35              
 [4] htmltools_0.3.6               rtracklayer_1.38.3            yaml_2.1.18                  
 [7] interactiveDisplayBase_1.16.0 blob_1.1.1                    XML_3.98-1.10                
[10] DBI_0.8                       BiocParallel_1.12.0           bit64_0.9-7                  
[13] matrixStats_0.53.1            GenomeInfoDbData_1.0.0        stringr_1.3.0                
[16] zlibbioc_1.24.0               ProtGenerics_1.10.0           Biostrings_2.46.0            
[19] memoise_1.1.0                 biomaRt_2.34.2                httpuv_1.3.6.2               
[22] BiocInstaller_1.28.0          curl_3.2                      Rcpp_0.12.16                 
[25] xtable_1.8-2                  DelayedArray_0.4.1            XVector_0.18.0               
[28] mime_0.5.1                    bit_1.1-12                    Rsamtools_1.30.0             
[31] AnnotationHub_2.10.1          RMySQL_0.10.14                digest_0.6.15                
[34] stringi_1.1.7                 shiny_1.0.5                   grid_3.4.4                   
[37] tools_3.4.4                   bitops_1.0-6                  magrittr_1.5                 
[40] RCurl_1.95-4.10               lazyeval_0.2.1                RSQLite_2.0                  
[43] pkgconfig_2.0.1               Matrix_1.2-12                 prettyunits_1.0.2            
[46] assertthat_0.2.0              httr_1.3.1                    R6_2.2.2                     
[49] GenomicAlignments_1.14.1      compiler_3.4.4
ADD REPLY
0
Entering edit mode

yes it is R version problem. Thank you

ADD REPLY
3
Entering edit mode
7.3 years ago

If you have the mygene library installed in Python, you could use the following Python script:

#!/usr/bin/env python

import sys
import mygene

mg = mygene.MyGeneInfo()

genes = []
for line in sys.stdin:
    genes.append(line.strip())

for gene in genes:
    result = mg.query(gene, scopes="symbol", fields=["ensembl"], species="human", verbose=False)
    hgnc_name = gene
    for hit in result["hits"]:
        if "ensembl" in hit and "gene" in hit["ensembl"]:
            sys.stdout.write("%s\t%s\n" % (hgnc_name, hit["ensembl"]["gene"]))

If you don't have mygene installed and you want to install it, you could run the following:

$ pip install mygene

As an example, here are HGNC names of genes in a file called "hgnc.txt":

DDX26B
CCDC83
MAST3
RPL11
ZDHHC20
LUC7L3
SNORD49A
CTSH
ACOT8

The above script would give the following output:

$ ./map_hgnc_to_ensg.py < hgnc.txt
DDX26B  ENSG00000225235
DDX26B  ENSG00000165359
CCDC83  ENSG00000150676
MAST3   ENSG00000099308
RPL11   ENSG00000142676
ZDHHC20 ENSG00000180776
ZDHHC20 ENSG00000236953
LUC7L3  ENSG00000108848
SNORD49A        ENSG00000277370
CTSH    ENSG00000103811
ACOT8   ENSG00000101473

You could write the output to a text file like so:

$ ./map_hgnc_to_ensg.py < hgnc.txt > hgnc_mapped_to_ensg.txt
ADD COMMENT
2
Entering edit mode

in R:

library(mygene)
mygenes=c("TP53","AGTR1")
queryMany(mygenes, scopes="symbol", fields="ensembl.gene", species="human")[,3:4]

output:

DataFrame with 2 rows and 2 columns
        query    ensembl.gene
  <character>     <character>
1        TP53 ENSG00000141510
2       AGTR1 ENSG00000144891
ADD REPLY
2
Entering edit mode

There is some thing interesting going on with gene "DDX26B" in input list. Both R package and python library for mygene is not able to fetch corresponding Ensembl ID when querymany (in R and python) is used. If one uses "query" function in mygene (in R) DDX26B gene, then mygene server prints correct ensembl gene ID. To cross check the ID, I have used ensembldb package in R and IDs match. It seems querymany and query are working differently for gene "DDX26B". I checked R package vignette and In vignette (pdf bundled with mygene R package), querymany doesn't have corresponding ensembl gene ID.

ADD REPLY
1
Entering edit mode

Thanks for that feedback — really useful to know! I edited my answer with a per-gene query, which seems to give a more complete answer than the querymany approach. I also posted an issue ticket on the Python client repo: https://github.com/biothings/mygene.py/issues/1

ADD REPLY
1
Entering edit mode

Looks like some additional arguments to the scopes option in querymany can resolve the issue: https://github.com/biothings/mygene.py/issues/1#issuecomment-320796036

ADD REPLY
0
Entering edit mode

thanks for the comment and went through the post on Github :) @Alex Reynolds

ADD REPLY
1
Entering edit mode
7.3 years ago
h.mon 35k

Check the links suggested by genomax and Michael Dondrup, they have plenty of solutions. For the sake of having an example here, I will give an R/BioConductor solution using the package pathview:

map.ensembl <-geneannot.map( in.ids = geneNames,
                            in.type = "SYMBOL", out.type = "ENSEMBL", org = "Hs" )

Check geneannot.map, eg2id and id2eg documentation.

ADD COMMENT
0
Entering edit mode
7.1 years ago
gaoteng ▴ 70

It seems that there are already many good solutions out there, but here is a python script that I wrote that achieves the same purpose: https://github.com/teng-gao/genomics_utils

ADD COMMENT

Login before adding your answer.

Traffic: 1973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6