Annotation of ID from TCGA gene expression data
3
0
Entering edit mode
7.2 years ago
Fdota ▴ 10

Hello,

What is the annotation of the ID's from TCGA gene expression data, like: ENSG00000242268.2 ENSG00000146083.10

How can I interpret them? Thanks!

tcga gene expression ID annotation • 7.7k views
ADD COMMENT
0
Entering edit mode

These are the different ensemble IDs for the different gene symbols, and the digits after the point represent the versions, so one gene symbol can have many versions with the same ensemble Id.

ADD REPLY
2
Entering edit mode
7.2 years ago
poisonAlien ★ 3.2k

They are Ensemble Gene IDs. See here for converting them to Gene symbols.

ADD COMMENT
2
Entering edit mode
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I have the following problem after converting the ensemble gene ids to HUGO symbols using the biomart package:

When intergrating DNA seq and RNA seq data, HUGO symbols are only overlapping by ~75%; for instance the gene symbol ALPG (converted from RNA seq) is annotated as ALPPL2 in the DNA seq data.

Did you experience something similar/does anyone know how to deal with old HUGO symbols?

ADD REPLY
3
Entering edit mode
7.2 years ago

Given a Python script that uses the mygene library, called translate-to-hgnc.py:

#!/usr/bin/env python

import sys
from mygene import MyGeneInfo

mg = MyGeneInfo()

genes = ["ENSG00000242268", "ENSG00000146083"]

for gene in genes:
    result = mg.query(gene, fields=["symbol"], species="human", verbose=False)
    for hit in result['hits']:
        sys.stderr.write("%s\t%s\n" % (gene, hit['symbol']))

Then you can get the HGNC names via:

$ ./translate-to-hgnc.py > hgnc-mapping.txt
$ more hgnc-mapping.txt
ENSG00000242268 LINC02082
ENSG00000146083 RNF44

Then you can get a list of genomic positions of HGNC symbols at:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/refGene.txt.gz | gunzip -c | awk -v OFS="\t" '($9>1){ print $3,$5,$6,$13,$9,$4 }' | sort-bed - > hg38.hgnc.bed

Then you can filter this file for your translated HGNC names:

$ cut -f2 hgnc-mapping.txt > hgnc-list.txt
$ LC_ALL=C && grep -F -f hgnc-list.txt hg38.hgnc.bed > hg38.hgnc.filtered.bed

The file hg38.hgnc.filtered.bed contains genomic positions of your HGNC/Ensembl genes of interest, which you can map against other genomic annotations (Gencode, etc.) via bedmap or similar.

ADD COMMENT
0
Entering edit mode

We just need to remove the digits after the dot right? ENSG00000242268.2 actually is same as ENSG00000242268? What does ".02" indicate? Thanks.

ADD REPLY
0
Entering edit mode

The digits after the point represent the versions as you can refer to the ensemble Website and will get to know about it soon.

ADD REPLY

Login before adding your answer.

Traffic: 2532 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6