How can I best plot a histogram for billions of genotype quality values?
I have a simple one column file with billions of genotype quality values. The file is several GB uncompressed.
Is there a statistics library in Python or R that can build up a histogram by streaming trough the data? Instead of loading everything to memory and then creating the histogram? I prefer using all of the data versus sampling it.
Or do I have to write a script first to collect the counts per bin and then give those count per bin to R for plotting? This functionality feels like it should already exist in a stats library some where.
I know the min and max of the values and would be able to specify a bin size.
what kind of graph do you need ? qual=f(pos) ?
I've been always intrigued by the potential of using a data log visualizer like RRD see http://oss.oetiker.ch/rrdtool/index.en.html to visualize genomic data (the time would be replaced by the coordinate of the genome).
Could you motivate this? If you want to produce a histogram to visualize the distribution of quality values, I don't see why you need billions of data points, especially since the range of quality values is discrete and not very large. After you have collected a million or so data points at random you have a pretty good estimate of the all thing.
Sampling also adds complexity and risk doing it wrong. I at least would need to think about how to do it correctly. By itself the frequency of the analysis and size of the data still allow for processing the total collection.