Question

convert Ensemble to HGNC symbols in a file with several columns

0

Entering edit mode

4.2 years ago

storm1907 ▴ 30

Hello, according to suggestions in this forum I installed MariaDB to convert ENSG to HGNC. The problem is that i have complicated files with several columns:

GENE    CASE_COUNT_HET  CASE_COUNT_CH   CASE_COUNT_HOM  CASE_TOTAL_AC   CONTROL_COUNT_HET       CONTROL_COUNT_HOM       CONTROL_TOTAL_AC        P_DOM   P_REC
ENSG00000005022 0       0       0       0       0       0       0       1       1
ENSG00000006327 1       0       0       1       0       0       0       0.000223634187089594    1
ENSG00000007376 1       0       0       1       0       0       0       0.000223634187089594    1
ENSG00000008838 1       0       0       1       0       0       0       0.000223634187089594    1
ENSG00000013503 1       0       0       1       0       0       0       0.000223634187089594    1
ENSG00000013573 1       0       0       1       82169   0       82169   1       1
ENSG00000034152 1       1       0       2       0       0       0       0.000223634187089594    0.000223634187089594
ENSG00000035115 1       0       0       1       50927   0       50927   0.99999999348136        1

I launched these commands: mysql -h ensembldb.ensembl.org --port 5306 -u anonymous -D homo_sapiens_core_64_37 -A

> select distinct
   G.stable_id,
   S.synonym
from
  gene_stable_id as G,
  object_xref as OX,
  external_synonym as S,
  xref as X ,
  external_db as D
where
  D.external_db_id=X.external_db_id and
  X.xref_id=S.xref_id and
  OX.xref_id=X.xref_id and
  OX.ensembl_object_type="GENE" and
  G.gene_id=OX.ensembl_id and
  G.stable_id in ("file.txt");

But I keep getting Empty set (0.041 sec) error

Does anybody have experience with MariaDB and conversions?

Thank you!

mariadb • 2.5k views

ADD COMMENT • link 4.2 years ago by storm1907 ▴ 30

1

Entering edit mode

Converting HGNC to ensembl and entrez id's using biomart goes in the other direction from what you want but you can use the same program to get HGNC ID's.

mart <- useMart('ENSEMBL_MART_ENSEMBL', host = 'www.ensembl.org')
mart <- useDataset('hsapiens_gene_ensembl', mart)
mapping <- getBM(
    attributes = c('hgnc_symbol', 'ensembl_gene_id'),
    mart = mart,
    uniqueRows = TRUE,
    bmHeader = T)
head(mapping)
  HGNC symbol  Gene stable ID
1       MT-TF ENSG00000210049
2     MT-RNR1 ENSG00000211459
3       MT-TV ENSG00000210077
4     MT-RNR2 ENSG00000210082
5      MT-TL1 ENSG00000209082
6      MT-ND1 ENSG00000198888

Using UCSC MySQL server: Converting Ensembl Gene Ids To Hgnc Gene Name / Coordinates

ADD REPLY • link 4.2 years ago by GenoMax 153k

score 0 · Answer 1 · 2021-06-28

0

Entering edit mode

4.2 years ago

Alex Reynolds 36k

You could use the mygene.info service, instead:

#!/usr/bin/env python

import sys
import io
import mygene

genes_str = '''ENSG00000005022
ENSG00000006327
ENSG00000007376
ENSG00000008838
ENSG00000013503
ENSG00000013573
ENSG00000034152
ENSG00000035115
'''

# read in Ensembl gene names, swap out for sys.stdin or other file handle etc.
genes_fh = io.StringIO(genes_str)
genes = []
for gene in genes_fh:
    genes.append(gene.rstrip())

# write out Ensembl gene names and HGNC symbols
mg = mygene.MyGeneInfo()
result = mg.querymany(genes, scopes="ensembl.gene", fields=["symbol"], species="human", verbose=False)
for res in result:
    if "symbol" in res:
        sys.stdout.write("%s\t%s\n" % (res["query"], res["symbol"]))

Example result:

% ./so947764.py
ENSG00000005022 SLC25A5
ENSG00000006327 TNFRSF12A
ENSG00000007376 RPUSD1
ENSG00000008838 MED24
ENSG00000013503 POLR3B
ENSG00000013573 DDX11
ENSG00000034152 MAP2K3
ENSG00000035115 SH3YL1

Should be pretty straightforward to SELECT on just the column of gene IDs. You can write this to a text file (e.g. called genes.txt) and then just read it in to sys.stdin. That is, replace:

genes_fh = io.StringIO(genes_str)
genes = []
for gene in genes_fh:
    genes.append(gene.rstrip())

With:

genes = []
for gene in sys.stdin:
    genes.append(gene.rstrip())

Then run the script like so:

% ./so947764.py < genes.txt
...

ADD COMMENT • link 4.2 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you! I installed mygene mdule in Pycharm, run so947764.py < file.txt but got

Traceback (most recent call last):
  File "/path/so947764.py", line 18, in <module>
    genes_fh = io.StringIO(genes_str)
TypeError: initial_value must be unicode or None, not str

ADD REPLY • link 4.2 years ago by storm1907 ▴ 30

0

Entering edit mode

I think that is a Python 2 error. Set up a virtual environment with Python 3 and run the script there, if that is what you're using. Or put the letter u before the string, e.g.:

genes_str = u'''ENSG00000005022
ENSG00000006327
ENSG00000007376
ENSG00000008838
ENSG00000013503
ENSG00000013573
ENSG00000034152
ENSG00000035115
'''

ADD REPLY • link 4.2 years ago by Alex Reynolds 36k

0

Entering edit mode

ok, I changed python version. Is the usage script.py < file.txt correct?

ADD REPLY • link 4.2 years ago by storm1907 ▴ 30

score 0 · Answer 2 · 2021-06-29

0

Entering edit mode

4.2 years ago

storm1907 ▴ 30

Thank you all!

Adding to these solutions, I also found this one working: https://github.com/vanallenlab/EnsemblToHGNC

ADD COMMENT • link 4.2 years ago by storm1907 ▴ 30