Hello,
What is the annotation of the ID's from TCGA gene expression data, like: ENSG00000242268.2 ENSG00000146083.10
How can I interpret them? Thanks!
Hello,
What is the annotation of the ID's from TCGA gene expression data, like: ENSG00000242268.2 ENSG00000146083.10
How can I interpret them? Thanks!
They are Ensemble Gene IDs. See here for converting them to Gene symbols.
http://www.ensembl.org/info/data/biomart/biomart_r_package.html
TCGA barcode: https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode
I have the following problem after converting the ensemble gene ids to HUGO symbols using the biomart package:
When intergrating DNA seq and RNA seq data, HUGO symbols are only overlapping by ~75%; for instance the gene symbol ALPG (converted from RNA seq) is annotated as ALPPL2 in the DNA seq data.
Did you experience something similar/does anyone know how to deal with old HUGO symbols?
Given a Python script that uses the mygene
library, called translate-to-hgnc.py
:
#!/usr/bin/env python
import sys
from mygene import MyGeneInfo
mg = MyGeneInfo()
genes = ["ENSG00000242268", "ENSG00000146083"]
for gene in genes:
result = mg.query(gene, fields=["symbol"], species="human", verbose=False)
for hit in result['hits']:
sys.stderr.write("%s\t%s\n" % (gene, hit['symbol']))
Then you can get the HGNC names via:
$ ./translate-to-hgnc.py > hgnc-mapping.txt
$ more hgnc-mapping.txt
ENSG00000242268 LINC02082
ENSG00000146083 RNF44
Then you can get a list of genomic positions of HGNC symbols at:
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/refGene.txt.gz | gunzip -c | awk -v OFS="\t" '($9>1){ print $3,$5,$6,$13,$9,$4 }' | sort-bed - > hg38.hgnc.bed
Then you can filter this file for your translated HGNC names:
$ cut -f2 hgnc-mapping.txt > hgnc-list.txt
$ LC_ALL=C && grep -F -f hgnc-list.txt hg38.hgnc.bed > hg38.hgnc.filtered.bed
The file hg38.hgnc.filtered.bed
contains genomic positions of your HGNC/Ensembl genes of interest, which you can map against other genomic annotations (Gencode, etc.) via bedmap
or similar.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
These are the different ensemble IDs for the different gene symbols, and the digits after the point represent the versions, so one gene symbol can have many versions with the same ensemble Id.