Hi, I apologise if any of the things mentioned below have been asked before, but after quite a long search I have not been able to find any answers.
I am new to the community and am trying to learn some basic data exploration on open data found within TCGA, particularly methylation beta values as well as gene expression values.
My questions are as follows, I greatly appreciate any help or guide that I can look to:
- The DNA methylation beta values are derived from Illumina 450K. However, these are by their probe IDs which I would like to convert to their respective ensembl_gene_ids and hugo_symbols, preferably updated to GrCh38.
The Illumina manifest file available on their website seems to be incomplete and I am unsure why that is so. Meanwhile, there seems to be another file created by AP Zhou here: https://zwdzwd.github.io/InfiniumAnnotation which I am looking at since it is updated to GrCh38. However, I do not understand why there are multiple genes mapped to one CpG region, or is it due to the fact that it is defined as 1.5kbps upstream to downstream the transcription start site?
- One thing in common for both the dna methylation beta files and the gene expression files is that they rely on gene_names. However, I have not been able to find a way to may all of them. In the case for the annotated file by AP Zhou, it seems that he has mixed "gene names" along with the "hugo_symbol". I have tried running the "gene_names" on biomart and have not been able to find their respective ensembl_gene_ids, along with ther hugo_symbols. I am not too sure how to go about converting these "gene names", nor do I have any clue as to what they are (I am assuming that they are entrez accession numbers, but even that yielded no results).
So the question here would be: what are those gene names and how should I convert it?
I have attached some examples of the gene names that I was unable to find results for: AC008972.1 AL162431.1 AL161731.1 AC018766.1 AC245100.4 AL160171.1
- I would also like to enquire if there are any current streamlined methods to analysing dna methylation values with respect to their genes between cancer normal matched patient samples, and if so are there any papers that talk aboout it?
Thank you so much in advance!