Hi,
Currently I'm writing a program that outputs to a .wig format. I can output to either fixedStep or variableStep. The data is continuous and genome-wide but there are long stretches of 0-values (that I want to preserve as 0) which led me to consider using variableStep. It is much faster for I/O purposes to write out a variableStep wig file.
However when trying both wig formats out for the same underlying equivalent data, after bigWig conversion both the size and underlying metadata are vastly different. I can verify that entire continuous dataset when read from both files after bigWig conversion are identical.
For my purposes chr19 mm39 the bigWig created from a variableStep is over 10x the size on disk compared to fixedStep. bigWigInfo reports a significant difference in bases covered and data/index sizes:
From variableStep wig:
- primaryDataSize: 191,590,560
- primaryIndexSize: 196,697,492
- basesCovered: 193,879,202
From fixedStep wig:
- primaryDataSize: 9,622,438
- primaryIndexSize: 1,930,824
- basesCovered: 61,420,004
I cannot find a source for the technical differences. Intuitively I would think that all variableStep regions are indexed separately but I cannot confirm. Nor am I certain about any performance implication between the two in terms of random access. I have tried looking at the original paper (https://pmc.ncbi.nlm.nih.gov/articles/PMC2922891/).
Any explanation or a source would be greatly appreciated.
Thanks!
The paper you linked is the same one I mentioned I had already gone through. There is no mention in the paper, as far as I can tell, that distinguishes the differences in how it handles variableStep vs fixedStep wig Files. I assume like most there definitely is some difference in indexing as you mentioned but I'm not certain what, if any implications are. Nor can I find a definitive source as to why, or why I would consider one over the other or what the practical tradeoff would be.
There is some difference in what data are stored; take a look at the supplementary data document for bigWig format details. You might also want to look at the source code for BedGraph to bigWig conversion to see how indexing is done for the different step-formats.
As far as tradeoffs go, a 10x size difference would seem substantial if you need to store a lot of datasets. You might think about what your variables are, in terms of what time you have available for compression, and what disk storage you have available for compressed data.
Another piece of source that may be easier to look through is libBigWig from Devon Ryan's pyBigWig kit. Based on the comments in one header file, I'm curious to know if you are mixing fixed and variable step data in one dataset, as it appears that may create additional blocks, each with their own overhead.
No mixing. It's all fixed or all variable. My issue could be specific to wigToBigWig. I am only assuming each variableStep declaration creates a new node in the R tree but maybe for some minimum interval?
Ideally would like to be able reason and justify my output choice by citing some source other than looking it up in the source code itself.