Genomic Data Integration - Where To Find Good Information?
4
5
Entering edit mode
13.5 years ago
Brth ▴ 50

For a while, I've been working with "genomic data integration", and I have been planning on putting together a literature review/some kind of look in to what has been done in the field, what the major challenges and future directions are etc.

For some reason, even though this is a common topic for many research groups, I have not been able to find good information resources about it. For example I would be interested in representative articles about genomic data integration research (not just articles where data has been integrated, but where data integration itself has been studied), or even good reviews about the topic.

What I mean with "genomic data integration" is how to generally integrate different types of data, e.g. DNA/RNA/protein/metabolomics, data from different technology platforms (e.g. Illumina and Affymetrix gene expression microarrays), cross-species data, or data from different genomic databases.

I would love to hear your opinions about these topics, and would really appreciate if anyone could point me towards:

  • Primary research about data integration
  • Review articles
  • Or even representative examples of how data integration has been successfully applied

If you want to share your own experience/opinions, I would be interested to hear your thought generally about the topic, but also about:

  • Major challenges
  • Best methods/tools
  • Ideas about what future holds for the field
  • Who are the "gurus" on the field, e.g. whose work should I especially look into?

Thanks!

data genomics • 4.2k views
ADD COMMENT
0
Entering edit mode

community wiki ?

ADD REPLY
0
Entering edit mode

This approach is a bit beyond me, but this paper is pretty cool.

ADD REPLY
3
Entering edit mode
13.5 years ago

When I hear about data integration (in a general) the first thing that come to my mind is the resource description framework (RDF) (aka the "semantic web") where all the data are defined as a triple

(subject-predicate-object)

for example:

<http://www.ncbi.nlm.nih.gov/omim/102500>
<http://purl.org/dc/elements/1.1/title>
"CHENEY SYNDROME"
.

For a tutorial about RDF: http://www.w3.org/TR/rdf-syntax/

A nice example of a RDF database is Bio2RDF: http://bio2rdf.org

http://www.ncbi.nlm.nih.gov/pubmed/18472304

J Biomed Inform. 2008 Oct;41(5):706-16. Epub 2008 Mar 21.

Bio2RDF: towards a mashup to build bioinformatics knowledge systems.

alt text (http://www.mquter.qut.edu.au/bio/bio2rdf_default.aspx)

See also: the SADI framework:

http://www.ncbi.nlm.nih.gov/pubmed/21210986

BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S7. SADI, SHARE, and the in silico scientific method. Wilkinson MD, McCarthy L, Vandervalk B, Withers D, Kawas E, Samadian S.

"major challenge": It don't scale for the huge dataset. For example I don't think it would be a good idea to fill a RDF datastore with the 1000 genomes data or dbSNP.

"Ideas about what future holds for the field": scalability: being able to store a large set of triples and querying it. Finding a common identifier for all the biological entities

"Gurus": https://twitter.com/#!FrancoisBelleau https://twitter.com/#!/kidehen , https://twitter.com/#!/danja , https://twitter.com/#!/danbri etc...

"Best methods/tools": as far as i know http://virtuoso.openlinksw.com/ . http://jena.sourceforge.net/ is also a well known tool.

ADD COMMENT
0
Entering edit mode

+1 for introducing me to academic use of 'mashup'. R

ADD REPLY
0
Entering edit mode
13.5 years ago

The MixOmics/integrOmics package is a nice collection of statistical methods, written in R, that can be applied to dataset integration:

http://www.math.univ-toulouse.fr/~biostat/mixOmics/Introduction.html

A couple of other papers about sparse partial least squares and sparse canonical correlation methods applied to this problem:

http://www.biomedcentral.com/1471-2105/10/34

http://www.ncbi.nlm.nih.gov/pubmed/19572827

ADD COMMENT
0
Entering edit mode
13.5 years ago
Quariel • 0

Hi,

I'm working in the computational genomics program at the Center for Genomic Sciences in UNAM, Mexico. I hope this paper gives you an idea of our work, I think, this is a great example of integration.

Nucleic Acids Res. 2011 Jan;39(Database issue):D98-105. Epub 2010 Nov 4.

RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units).

Gama-Castro S, Salgado H, Peralta-Gil M, Santos-Zavaleta A, Muñiz-Rascado L, Solano-Lira H, Jimenez-Jacinto V, Weiss V, García-Sotelo JS, López-Fuentes A, Porrón-Sotelo L, Alquicira-Hernández S, Medina-Rivera A, Martínez-Flores I, Alquicira-Hernández K, Martínez-Adame R, Bonavides-Martínez C, Miranda-Ríos J, Huerta AM, Mendoza-Vargas A, Collado-Torres L, Taboada B, Vega-Alvarado L, Olvera M, Olvera L, Grande R, Morett E, Collado-Vides J. Source

Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, AP 565-A, Cuernavaca, Morelos 62100, México. Abstract

RegulonDB ( http://regulondb.ccg.unam.mx/ ) is the primary reference database of the best-known regulatory network of any free-living organism, that of Escherichia coli K-12. The major conceptual change since 3 years ago is an expanded biological context so that transcriptional regulation is now part of a unit that initiates with the signal and continues with the signal transduction to the core of regulation, modifying expression of the affected target genes responsible for the response. We call these genetic sensory response units, or Gensor Units. We have initiated their high-level curation, with graphic maps and superreactions with links to other databases. Additional connectivity uses expandable submaps. RegulonDB has summaries for every transcription factor (TF) and TF-binding sites with internal symmetry. Several DNA-binding motifs and their sizes have been redefined and relocated. In addition to data from the literature, we have incorporated our own information on transcription start sites (TSSs) and transcriptional units (TUs), obtained by using high-throughput whole-genome sequencing technologies. A new portable drawing tool for genomic features is also now available, as well as new ways to download the data, including web services, files for several relational database manager systems and text files including BioPAX format.

ADD COMMENT
0
Entering edit mode
13.5 years ago

I hate to be that guy posting self-links, but the first chapter of my thesis is a review of detection and integration of genomic data in the context of cancer. You might find it useful - especially the second half. http://chrisamiller.com/science/Miller_Chapter1.pdf

ADD COMMENT

Login before adding your answer.

Traffic: 1881 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6