I'm beginner in the bioinformatics area and I have some doubts on how to work with integration of biological data involving RNA-Seq and DNA methylation (Illumina 450k).
I have done some research on the internet and found several articles related to data integration process based on concatenation, but I've been difficulty in reproducing their experiments in order to learn how to manipulate the data.
I would like to integrate the RNA-seq data with DNA methylation. About the integration process, I imagine that is at GeneRef ID. But observing the methylation data, each sample containing multiple probes of the methylation levels for the genes. Therefore, there are cases where there are more probes for the same gene. Below is an example of DNA Methylation:
Heat: 6005486023_R04C02, IlmnID, CHR UCSC_RefGene_Name, UCSC_RefGene_Group, Relation_to_UCSC_CpG_Island
Data: 0.075187176583887, cg00000029,16, RBL2, TSS1500, N_Shore
I wonder what the treatment should be done to know the level of methylation of a gene having various probes. This right calculate the average of these probes? What is the right technique to get the methylation level for each gene?
I would like to generate a co-expression network just tumor data. Is this make sense?
Can anyone help me, please.
Thank you very much.
Sorry am a bit confused about the data you are trying to show. Are you using TCGA data or your own custom data? If TCGA data for both
450k
and RNAseq take a look at this tool in link which shows how you can create matrices for RNAseq across samples and probes for450k
with intensity and also gene names. Then you can use them farther inR
to create common probes from gene features across both platforms.450k
data and create a matrix from your samples with intesity values across samplss with row.names as probes? If so show them here and map them to various features. Show a head of the matrix with a proper formatting for me to read it.In any case if you have both matrix then you should be able to merge them based on gene id once you annotate your probes from
450k
to gene ids based on features you want to see. Either CpG sites /shores /islands/ , promoters, etc.Please reformat your query and also add things I asked then I can help more.
For
TSS200
andTSS1500
are distance from TSS and see which distances refers to as promoters, islands and shores.Hi,
All data were extracted from TCGA, Tumor condition, breast cancer disease.
mRNA and miRNA expression data are the correct pattern. The level of these data are 3. They are in a matrix containing the values of gene expression (mRNA and miRNA).
In the case of methylation data, patient samples were obtained from the level 1. Because of this, I needed apply some data analysis using pipeline from Illumina 450K platform. Thus, each sample containing multiple probes with the methylation levels of the genes. We obtain the level 1, because it has more information of CpG islands, Chromosome, gene region.
The problem is that I do not know how to work with the methylation data, as each patient sample contains several probes that may be associated with a single gene. How can I make the integration of these data?
I was thinking of generating an array for each methylation which has the expression levels of each gene. Another possibility is to create an matrix with the methylation levels of genes in the region of the gene: the promoter and body. In this case we would have a matrix containing the gene methylation levels in the promoter and the body region.
After each set of data having its own representation I would make the process of integration, resulting in a single matrix. This merge would be based on the id gene. After this, it would verify a statistical method to be applied in this matrix. Then, a network co-expression would be generated and topological analysis would be applied in this network trying to find genes associated with disease. The resulting in a gene prioritization process.
That future would walk for personalized medicine.
I based myself in the following articles:
http://www.ncbi.nlm.nih.gov/pubmed/26490630 http://www.nature.com/nrg/journal/v16/n2/abs/nrg3868.html
So, I'd like to apply the concatenation and/or transformation-based integration or early and/or intermediated integration process.
I am not able to understand the motivation of why you are selecting the raw data here? You can directly work with data of 450k from TCGA, here is how you do it, below code from the tool
TCGA2STAT
Import methylation expressionDNA methylation profiles were obtained either via HumanMethylation450 BeadChip. The platform probes for 450k contains ~450,000 CpG sites. The package allows users to import either methylation data when available, via type="27K" (default) or type="450K" here we use
450K
. For example:or equivalently
Get 450K methylation profiles for ovarian cancer patients
Note that the matrix of the methylation profiles returned is NOT aggregated at the gene level. Each row in the data matrix (dat) is a probe from the methylation assay, which represents a CpG site. Since genes often contain more than one CpG site and each CpG site can differ significantly in the methylation level, gene level aggregation is less desirable. Hence, our package returns the probe-level data.
Look at the data
The above example is for OV data you can do it for Breast. The last matrix shows probes to genes and you can tax this dataframe and merge probes with RNA-Seq data at gene level and build a big matrix with
probes, gene id, chr, pos. gene exp (values for samples)
There is no point for generating and reworking unless you want to build some pipeline of your own or you want to use some custom methylation analysis pipeline. If you are not aware of how to analyze methylation data in that case you have to first learn what and how the data is generate and what tools are used for its analysis and then see which tools allows you to build such matrices with probes id mapped to gene id with normalized intensity values.
So as of now you should be able to work with the normalized beta values from the TCGA directly (level 3)
Hope it is clear
Hi!
That's clear for me now! I will try to do it. I think it will work for us.
So, about the motivations to get data is because the system that I've implemented in which the user provide the data to analyse. Then, the TCGA data has used just testing the system.
Thank you very much for all patience in explaining the concepts.
Just to add, if my answers and comments are useful for you then accept them as answers and also put upvotes so that people can use it for future reference.