Summarizing Personal Genomics Data From A Large Number Of Individuals
4
3
Entering edit mode
14.3 years ago
Allpowerde ★ 1.3k

HI, visualizing genomic data for an individual is pretty straight forward ( discussion here ). Having a few individuals can also still be managed, but once you have more individuals than can be neatly fitted as tracks on the screen it gets tricky.

In this case, one can summarize the data by saying "1000 individuals had this SNP, while 600 hat this one" or visualize it as SNP-hotspot tracks. But isn't there a better way to summarize/visualize it, especially given that one feature or data set is never enough. And You end up comparing X HapMap individuals with the individuals of the 1000genomes project on a multitude of features like SNPs, CNVs, SVs ...

Has anyone a good set of tools/concepts for this problem?

next-gen sequencing hapmap genome • 3.0k views
ADD COMMENT
3
Entering edit mode
14.3 years ago

There's no way to visualize everything. It would probably help to step back and ask yourself, "what exactly is the point of this study?" and "what questions are we trying to answer?". These should drive the type of visualization and analysis that you perform.

If these are all individuals with a specific disorder, you may be looking for unusual and recurrently altered genes or pathways. So find a way to highlight these genes. What about a simple plot, where you put the genomic coordinates on the x-axis and place the frequency of mutations in each gene on the y axis? That should let you easily identify highly-mutated genes.

What about doing some pathway analysis: which KEGG pathways or GO terms are overrepresented? Figures like this one can help summarize those relationships. Using something like Cytoscape, you can create heat maps, showing a whole pathway and coloring specific members according to how frequently they're altered.

Are you looking at structural information, identifying breakpoints of rearrangements? Circos is a nice tool for visualizing this, especially if there are intrachromosomal translocations.

Bottom line: pretty pictures are nice, but what's important is that they give you some insight into the system you're studying, so start there.

ADD COMMENT
2
Entering edit mode
14.3 years ago

I have done barely any work on such data, but the first thing that comes to my mind is "dimensionality reduction". There is no way that you can visualize the full data on 1000 individuals. Conversely, the type of summary statistics that you mention may throw away too much detail. I could imagine using methods such as principal component analysis, independent component analysis, or multi-dimensional scaling to capture as much of the data as possible in as few dimensions as possible.

Sorry that I cannot suggest anything more concrete than that.

ADD COMMENT
0
Entering edit mode

HI Lars, I probably did not make this clear in my question: I'm not talking about data analysis (e.g. association to find a candidate SNP for a disease). I'm just talking about taking stock of the data I have in the context of other data sets.

ADD REPLY
2
Entering edit mode
14.3 years ago

For this problem, I'm using a Key/Value datastore (BerkeleyDB) .

  • The key is a position on the genome
  • The value is an array of genotypes for 'N' individuals.

Using this table, you can quickly compare and query such large tables.

HDF5 is also an option AFAIK.

ADD COMMENT
1
Entering edit mode
14.3 years ago
Casbon ★ 3.3k

http://browser.1000genomes.org/index.html

The 1kg browser is extending Ensembl to handle this level of variation. However, there doesn't seem to be much in the way of releases.

ADD COMMENT

Login before adding your answer.

Traffic: 1967 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6