Question

Generic Bioconductor Object To Integrate Different Genomics Data Formats/Platforms

6

Entering edit mode

12.2 years ago

Irsan ★ 7.8k

As a young bio-informatician I start to notice that, although genomics data comes in a inmensely wide variety of formats, when you have successfully pre-processed the data, most of it boils down to data about genomic locations.

For example, when you consider a copy number analysis study on tumor samples of breast cancer patients you could think of data from aCGH, SNP-arrays or WG-seq data from different platforms like Affymetrix, Nimblegen, Illumina and ABI Solid. After processing the data, irrespective of the platform, you end up with Log R ratios on genomic positions, allowing you to integrate the various platforms.

My question is what is the commonly used and community accepted (preferably) Bioconductor data object that can store genomic positional data and that is accepted by many dowstream analysis programs? For example, for the copy number analysis study, you would like to put all the Log R ratios on genomic positions across the genome of all the tumor samples obtained from different platform in one data object. Then, you would like to make a karyogram of the LRR values of each sample in heatmap/dotplot. Also you would like to make a frequency plot of the segmentations of all samples to identify common CNV segments. And do hierarchical clustering on the segmentations of all samples to identify subgroups in the patient cohort.

I have noticed that the GRanges object from GenomicRanges might serve as a dynamic container for data on genomic positions. For example, the package ggbio accepts a GRanges object to draw a karyograms. But I am unaware of other packages that support the GRanges/RangedData objects. (So is this the thing I am looking for??)

Ideally I would like to find a central Bioconductor object (is that GRanges???) that flexibly stores data on genomic positions (copy number, B allele frequency, methylation, transcription factor binding sites, expression, etc.) and is supported by many downstream analysis packages that do

ideogram/karyogram visualization (maybe even along with expression or methylation data)
hierarchical clustering
finding overlaps of genomic positions between samples / make frequency plots
perform gene set enrichment analysis

bioconductor • 3.4k views

ADD COMMENT • link updated 12.2 years ago by Steve Lianoglou 5.2k • written 12.2 years ago by Irsan ★ 7.8k

score 3 · Answer 1 · 2012-11-01

3

Entering edit mode

12.2 years ago

Steve Lianoglou 5.2k

The GenomicRanges packages is the right place to look.

There is also some effort to make the SummarizedExperiment class in the GenomicRanges package to be this "universal container of assay data over genomic ranges," but I'm not sure that you'll find any one data structure that plugs into all the analyses you are after as all of this is relatively new.

In addition to the ggbio package for visualization, you might want to explore if the Gviz package has any functionality you would find helpful.

ADD COMMENT • link 12.2 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

Hi steve, thanks for your suggestion about SummarizedExperiment-class and the Gvix package, I will look into it. BTW, I just saw you can also do CBS-segmentation on GenomicRanges-objects with a new BioC-package called "fastseg"