As a young bio-informatician I start to notice that, although genomics data comes in a inmensely wide variety of formats, when you have successfully pre-processed the data, most of it boils down to data about genomic locations.
For example, when you consider a copy number analysis study on tumor samples of breast cancer patients you could think of data from aCGH, SNP-arrays or WG-seq data from different platforms like Affymetrix, Nimblegen, Illumina and ABI Solid. After processing the data, irrespective of the platform, you end up with Log R ratios on genomic positions, allowing you to integrate the various platforms.
My question is what is the commonly used and community accepted (preferably) Bioconductor data object that can store genomic positional data and that is accepted by many dowstream analysis programs? For example, for the copy number analysis study, you would like to put all the Log R ratios on genomic positions across the genome of all the tumor samples obtained from different platform in one data object. Then, you would like to make a karyogram of the LRR values of each sample in heatmap/dotplot. Also you would like to make a frequency plot of the segmentations of all samples to identify common CNV segments. And do hierarchical clustering on the segmentations of all samples to identify subgroups in the patient cohort.
I have noticed that the GRanges object from GenomicRanges might serve as a dynamic container for data on genomic positions. For example, the package ggbio accepts a GRanges object to draw a karyograms. But I am unaware of other packages that support the GRanges/RangedData objects. (So is this the thing I am looking for??)
Ideally I would like to find a central Bioconductor object (is that GRanges???) that flexibly stores data on genomic positions (copy number, B allele frequency, methylation, transcription factor binding sites, expression, etc.) and is supported by many downstream analysis packages that do
- ideogram/karyogram visualization (maybe even along with expression or methylation data)
- hierarchical clustering
- finding overlaps of genomic positions between samples / make frequency plots
- perform gene set enrichment analysis
Hi steve, thanks for your suggestion about SummarizedExperiment-class and the Gvix package, I will look into it. BTW, I just saw you can also do CBS-segmentation on GenomicRanges-objects with a new BioC-package called "fastseg"
For CGH data, you might also check a genoset.