Question

TCGA Methylation Beta and Gene Expression Help

0

Entering edit mode

10 months ago

zhenjie.chin88 • 0

Hi, I apologise if any of the things mentioned below have been asked before, but after quite a long search I have not been able to find any answers.

I am new to the community and am trying to learn some basic data exploration on open data found within TCGA, particularly methylation beta values as well as gene expression values.

My questions are as follows, I greatly appreciate any help or guide that I can look to:

The DNA methylation beta values are derived from Illumina 450K. However, these are by their probe IDs which I would like to convert to their respective ensembl_gene_ids and hugo_symbols, preferably updated to GrCh38.

enter image description here

The Illumina manifest file available on their website seems to be incomplete and I am unsure why that is so. Meanwhile, there seems to be another file created by AP Zhou here: https://zwdzwd.github.io/InfiniumAnnotation which I am looking at since it is updated to GrCh38. However, I do not understand why there are multiple genes mapped to one CpG region, or is it due to the fact that it is defined as 1.5kbps upstream to downstream the transcription start site? enter image description here

One thing in common for both the dna methylation beta files and the gene expression files is that they rely on gene_names. However, I have not been able to find a way to may all of them. In the case for the annotated file by AP Zhou, it seems that he has mixed "gene names" along with the "hugo_symbol". I have tried running the "gene_names" on biomart and have not been able to find their respective ensembl_gene_ids, along with ther hugo_symbols. I am not too sure how to go about converting these "gene names", nor do I have any clue as to what they are (I am assuming that they are entrez accession numbers, but even that yielded no results).

So the question here would be: what are those gene names and how should I convert it?

I have attached some examples of the gene names that I was unable to find results for: AC008972.1 AL162431.1 AL161731.1 AC018766.1 AC245100.4 AL160171.1

I would also like to enquire if there are any current streamlined methods to analysing dna methylation values with respect to their genes between cancer normal matched patient samples, and if so are there any papers that talk aboout it?

Thank you so much in advance!

ensembl expression hgnc gene-names methylation • 1.0k views

ADD COMMENT • link updated 9 months ago by yura.grabovska ▴ 810 • written 10 months ago by zhenjie.chin88 • 0

score 0 · Answer 1 · 2024-11-12

yes, you got why there are multiple genes
if you downloaded the data from the GDC, here is the reference file page https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files that contains a modified version of Wanding's annotation file with ensemble IDs that you can map directly
methylation data analysis is tricky

score 0 · Answer 2 · 2024-11-12

However, I do not understand why there are multiple genes mapped to one CpG region

The easiest answer to this is that the genes are annotated against transcripts and genes can have multiple transcripts. In the case of the examples you posted in your table - you're listing long non-coding RNAs and microRNAs. The loci for these are messy and so the annotation reflects this.

So the question here would be: what are those gene names and how should I convert it?

AC008972.1 is not a gene, it is a lncRNA. You can filter out these to begin with by comparing against something like:

library(EnsDb.Hsapiens.v86)
ens.genes <- genes(EnsDb.Hsapiens.v86)
ens.genes <- ens.genes[ens.genes$gene_biotype=="protein_coding"]
ens.genes$symbol

Are any current streamlined methods to analysing dna methylation values with respect to their genes between cancer normal matched patient samples

The simplest approach is DMRcate - which carries out differential methylation of regions by combining methylation of individual probes into windows of similar beta-values. This is an R package and has a fairly well annotated vignette. You still won't get 1:1 association between a probe and a gene, but it at least allows you to collapse probes into DMRs

You have to understand that the Illumina methylation arrays contain many different kinds of probes including CpG islands, probes in the middle of nowhere that have some kind of evidence based association with a disease, probes associated with regulatory elements, bimodal promoters etc etc. Regardless of the approach you take you will have to accept some set of assumptions and limitations.

How you analyse the data pretty much depends on your intended goal and the question you are seeking to answer.

PS Tim Triche has provided a set of tools for addressing some of the issues you are having but they are typically heared towards hg19 genome assemblies because that's what the 450K arrays were designed against.

help(FDb.InfiniumMethylation.hg19)

You could still reannotate the probes using hg38 coordinates if you want afterwards.