Question

Annotation of ID from TCGA gene expression data

0

Entering edit mode

7.2 years ago

Fdota ▴ 10

Hello,

What is the annotation of the ID's from TCGA gene expression data, like: ENSG00000242268.2 ENSG00000146083.10

How can I interpret them? Thanks!

tcga gene expression ID annotation • 7.8k views

ADD COMMENT • link updated 2.3 years ago by qiz218591 ▴ 10 • written 7.2 years ago by Fdota ▴ 10

0

Entering edit mode

These are the different ensemble IDs for the different gene symbols, and the digits after the point represent the versions, so one gene symbol can have many versions with the same ensemble Id.

ADD REPLY • link 2.3 years ago by qiz218591 ▴ 10

3

Entering edit mode

7.2 years ago

Alex Reynolds 36k

Given a Python script that uses the mygene library, called translate-to-hgnc.py:

#!/usr/bin/env python

import sys
from mygene import MyGeneInfo

mg = MyGeneInfo()

genes = ["ENSG00000242268", "ENSG00000146083"]

for gene in genes:
    result = mg.query(gene, fields=["symbol"], species="human", verbose=False)
    for hit in result['hits']:
        sys.stderr.write("%s\t%s\n" % (gene, hit['symbol']))

Then you can get the HGNC names via:

$ ./translate-to-hgnc.py > hgnc-mapping.txt
$ more hgnc-mapping.txt
ENSG00000242268 LINC02082
ENSG00000146083 RNF44

Then you can get a list of genomic positions of HGNC symbols at:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/refGene.txt.gz | gunzip -c | awk -v OFS="\t" '($9>1){ print $3,$5,$6,$13,$9,$4 }' | sort-bed - > hg38.hgnc.bed

Then you can filter this file for your translated HGNC names:

$ cut -f2 hgnc-mapping.txt > hgnc-list.txt
$ LC_ALL=C && grep -F -f hgnc-list.txt hg38.hgnc.bed > hg38.hgnc.filtered.bed

The file hg38.hgnc.filtered.bed contains genomic positions of your HGNC/Ensembl genes of interest, which you can map against other genomic annotations (Gencode, etc.) via bedmap or similar.

ADD COMMENT • link 7.2 years ago by Alex Reynolds 36k

0

Entering edit mode

We just need to remove the digits after the dot right? ENSG00000242268.2 actually is same as ENSG00000242268? What does ".02" indicate? Thanks.

ADD REPLY • link 5.8 years ago by Shicheng Guo ★ 9.6k

0

Entering edit mode

The digits after the point represent the versions as you can refer to the ensemble Website and will get to know about it soon.

ADD REPLY • link 2.3 years ago by qiz218591 ▴ 10

score 2 · Accepted Answer · 2017-09-28

2

Entering edit mode

7.2 years ago

poisonAlien ★ 3.2k

They are Ensemble Gene IDs. See here for converting them to Gene symbols.

ADD COMMENT • link 7.2 years ago by poisonAlien ★ 3.2k

score 2 · Accepted Answer · 2017-10-01

2

Entering edit mode

7.2 years ago

Fdota ▴ 10

http://www.ensembl.org/info/data/biomart/biomart_r_package.html

ADD COMMENT • link 7.2 years ago by Fdota ▴ 10

0

Entering edit mode

TCGA barcode: https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode

ADD REPLY • link 7.2 years ago by Fdota ▴ 10

0

Entering edit mode

I have the following problem after converting the ensemble gene ids to HUGO symbols using the biomart package:

When intergrating DNA seq and RNA seq data, HUGO symbols are only overlapping by ~75%; for instance the gene symbol ALPG (converted from RNA seq) is annotated as ALPPL2 in the DNA seq data.

Did you experience something similar/does anyone know how to deal with old HUGO symbols?

ADD REPLY • link 5.2 years ago by susibing ▴ 20