Question

How to plot a histogram for billions of genotype quality values?

1

Entering edit mode

9.0 years ago

William ★ 5.3k

How can I best plot a histogram for billions of genotype quality values?

I have a simple one column file with billions of genotype quality values. The file is several GB uncompressed.

Is there a statistics library in Python or R that can build up a histogram by streaming trough the data? Instead of loading everything to memory and then creating the histogram? I prefer using all of the data versus sampling it.

Or do I have to write a script first to collect the counts per bin and then give those count per bin to R for plotting? This functionality feels like it should already exist in a stats library some where.

I know the min and max of the values and would be able to specify a bin size.

vcf quality R python • 3.1k views

ADD COMMENT • link updated 9.0 years ago by biocyberman ▴ 870 • written 9.0 years ago by William ★ 5.3k

0

Entering edit mode

what kind of graph do you need ? qual=f(pos) ?

ADD REPLY • link 9.0 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

I've been always intrigued by the potential of using a data log visualizer like RRD see http://oss.oetiker.ch/rrdtool/index.en.html to visualize genomic data (the time would be replaced by the coordinate of the genome).

ADD REPLY • link 9.0 years ago by Istvan Albert 102k

0

Entering edit mode

I prefer using all of the data versus sampling it

Could you motivate this? If you want to produce a histogram to visualize the distribution of quality values, I don't see why you need billions of data points, especially since the range of quality values is discrete and not very large. After you have collected a million or so data points at random you have a pretty good estimate of the all thing.

ADD REPLY • link 9.0 years ago by dariober 15k

0

Entering edit mode

Sampling also adds complexity and risk doing it wrong. I at least would need to think about how to do it correctly. By itself the frequency of the analysis and size of the data still allow for processing the total collection.

ADD REPLY • link 9.0 years ago by William ★ 5.3k

score 3 · Accepted Answer · 2016-03-09

Or do I have to write a script first to collect the counts per bin and then give those count per bin to R for plotting? This functionality feels like it should already exist in a stats library some where.

This is what I'd do. You're right that it might exist, but writing a 10 line perl script to bin them might be quicker than searching. I'd wager you'd be done before someone chimes in with the library to do it :)

score 2 · Accepted Answer · 2016-03-11

2

Entering edit mode

9.0 years ago

biocyberman ▴ 870

Meet the selling point of Datashader: http://datashader.readthedocs.org/en/latest (even though I tend to filter and subset the data whenever possible).

ADD COMMENT • link 9.0 years ago by biocyberman ▴ 870