Question

Genomic Data Integration - Where To Find Good Information?

5

Entering edit mode

14.2 years ago

Brth ▴ 50

For a while, I've been working with "genomic data integration", and I have been planning on putting together a literature review/some kind of look in to what has been done in the field, what the major challenges and future directions are etc.

For some reason, even though this is a common topic for many research groups, I have not been able to find good information resources about it. For example I would be interested in representative articles about genomic data integration research (not just articles where data has been integrated, but where data integration itself has been studied), or even good reviews about the topic.

What I mean with "genomic data integration" is how to generally integrate different types of data, e.g. DNA/RNA/protein/metabolomics, data from different technology platforms (e.g. Illumina and Affymetrix gene expression microarrays), cross-species data, or data from different genomic databases.

I would love to hear your opinions about these topics, and would really appreciate if anyone could point me towards:

Primary research about data integration
Review articles
Or even representative examples of how data integration has been successfully applied

If you want to share your own experience/opinions, I would be interested to hear your thought generally about the topic, but also about:

Major challenges
Best methods/tools
Ideas about what future holds for the field
Who are the "gurus" on the field, e.g. whose work should I especially look into?

Thanks!

data genomics • 4.6k views

ADD COMMENT • link updated 14.2 years ago by Chris Miller 22k • written 14.2 years ago by Brth ▴ 50

0

Entering edit mode

community wiki ?

ADD REPLY • link 14.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

This approach is a bit beyond me, but this paper is pretty cool.

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 14.2 years ago by Russh ★ 1.2k

Ram · Answer 1 · 2011-06-09

When I hear about data integration (in a general) the first thing that come to my mind is the resource description framework (RDF) (aka the "semantic web") where all the data are defined as a triple

(subject-predicate-object)

for example:

<http://www.ncbi.nlm.nih.gov/omim/102500>
<http://purl.org/dc/elements/1.1/title>
"CHENEY SYNDROME"
.

For a tutorial about RDF: http://www.w3.org/TR/rdf-syntax/

A nice example of a RDF database is Bio2RDF: http://bio2rdf.org

http://www.ncbi.nlm.nih.gov/pubmed/18472304

J Biomed Inform. 2008 Oct;41(5):706-16. Epub 2008 Mar 21.

Bio2RDF: towards a mashup to build bioinformatics knowledge systems.

alt text (http://www.mquter.qut.edu.au/bio/bio2rdf_default.aspx)

See also: the SADI framework:

http://www.ncbi.nlm.nih.gov/pubmed/21210986

BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S7. SADI, SHARE, and the in silico scientific method. Wilkinson MD, McCarthy L, Vandervalk B, Withers D, Kawas E, Samadian S.

"major challenge": It don't scale for the huge dataset. For example I don't think it would be a good idea to fill a RDF datastore with the 1000 genomes data or dbSNP.

"Ideas about what future holds for the field": scalability: being able to store a large set of triples and querying it. Finding a common identifier for all the biological entities

"Gurus": https://twitter.com/#!FrancoisBelleau https://twitter.com/#!/kidehen , https://twitter.com/#!/danja , https://twitter.com/#!/danbri etc...

"Best methods/tools": as far as i know http://virtuoso.openlinksw.com/ . http://jena.sourceforge.net/ is also a well known tool.

Ram · Answer 2 · 2011-06-09

The MixOmics/integrOmics package is a nice collection of statistical methods, written in R, that can be applied to dataset integration:

http://www.math.univ-toulouse.fr/~biostat/mixOmics/Introduction.html

A couple of other papers about sparse partial least squares and sparse canonical correlation methods applied to this problem:

http://www.biomedcentral.com/1471-2105/10/34

http://www.ncbi.nlm.nih.gov/pubmed/19572827

Pierre Lindenbaum · Answer 3 · 2011-06-09

Hi,

I'm working in the computational genomics program at the Center for Genomic Sciences in UNAM, Mexico. I hope this paper gives you an idea of our work, I think, this is a great example of integration.

Nucleic Acids Res. 2011 Jan;39(Database issue):D98-105. Epub 2010 Nov 4.

RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units).

Gama-Castro S, Salgado H, Peralta-Gil M, Santos-Zavaleta A, Muñiz-Rascado L, Solano-Lira H, Jimenez-Jacinto V, Weiss V, García-Sotelo JS, López-Fuentes A, Porrón-Sotelo L, Alquicira-Hernández S, Medina-Rivera A, Martínez-Flores I, Alquicira-Hernández K, Martínez-Adame R, Bonavides-Martínez C, Miranda-Ríos J, Huerta AM, Mendoza-Vargas A, Collado-Torres L, Taboada B, Vega-Alvarado L, Olvera M, Olvera L, Grande R, Morett E, Collado-Vides J. Source

Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, AP 565-A, Cuernavaca, Morelos 62100, México. Abstract

RegulonDB ( http://regulondb.ccg.unam.mx/ ) is the primary reference database of the best-known regulatory network of any free-living organism, that of Escherichia coli K-12. The major conceptual change since 3 years ago is an expanded biological context so that transcriptional regulation is now part of a unit that initiates with the signal and continues with the signal transduction to the core of regulation, modifying expression of the affected target genes responsible for the response. We call these genetic sensory response units, or Gensor Units. We have initiated their high-level curation, with graphic maps and superreactions with links to other databases. Additional connectivity uses expandable submaps. RegulonDB has summaries for every transcription factor (TF) and TF-binding sites with internal symmetry. Several DNA-binding motifs and their sizes have been redefined and relocated. In addition to data from the literature, we have incorporated our own information on transcription start sites (TSSs) and transcriptional units (TUs), obtained by using high-throughput whole-genome sequencing technologies. A new portable drawing tool for genomic features is also now available, as well as new ways to download the data, including web services, files for several relational database manager systems and text files including BioPAX format.

score 0 · Answer 4 · 2011-06-09

0

Entering edit mode

14.2 years ago

Chris Miller 22k

I hate to be that guy posting self-links, but the first chapter of my thesis is a review of detection and integration of genomic data in the context of cancer. You might find it useful - especially the second half. http://chrisamiller.com/science/Miller_Chapter1.pdf

ADD COMMENT • link 14.2 years ago by Chris Miller 22k