Question

bigWig size difference between variableStep and fixedStep wig files

0

Entering edit mode

8 months ago

notthebmovieactor • 0

Hi,

Currently I'm writing a program that outputs to a .wig format. I can output to either fixedStep or variableStep. The data is continuous and genome-wide but there are long stretches of 0-values (that I want to preserve as 0) which led me to consider using variableStep. It is much faster for I/O purposes to write out a variableStep wig file.

However when trying both wig formats out for the same underlying equivalent data, after bigWig conversion both the size and underlying metadata are vastly different. I can verify that entire continuous dataset when read from both files after bigWig conversion are identical.

For my purposes chr19 mm39 the bigWig created from a variableStep is over 10x the size on disk compared to fixedStep. bigWigInfo reports a significant difference in bases covered and data/index sizes:

From variableStep wig:

primaryDataSize: 191,590,560
primaryIndexSize: 196,697,492
basesCovered: 193,879,202

From fixedStep wig:

primaryDataSize: 9,622,438
primaryIndexSize: 1,930,824
basesCovered: 61,420,004

I cannot find a source for the technical differences. Intuitively I would think that all variableStep regions are indexed separately but I cannot confirm. Nor am I certain about any performance implication between the two in terms of random access. I have tried looking at the original paper (https://pmc.ncbi.nlm.nih.gov/articles/PMC2922891/).

Any explanation or a source would be greatly appreciated.

Thanks!

bigWig wig • 976 views

ADD COMMENT • link 8 months ago by notthebmovieactor • 0

score 0 · Answer 1 · 2025-01-28

0

Entering edit mode

8 months ago

Alex Reynolds 36k

There's a paper by Kent et al that provides more information on the bigBed and bigWig formats, which use a data structure called a "cirTree" to index intervals. A variable-step file may need a larger tree index to account for different scales of interval step sizes, and different start positions, and a fixed-step file would not need as complicated an index. A 10x size difference seems like a lot but without looking at your data it is hard to say if there is something unusual.

ADD COMMENT • link 8 months ago by Alex Reynolds 36k

0

Entering edit mode

The paper you linked is the same one I mentioned I had already gone through. There is no mention in the paper, as far as I can tell, that distinguishes the differences in how it handles variableStep vs fixedStep wig Files. I assume like most there definitely is some difference in indexing as you mentioned but I'm not certain what, if any implications are. Nor can I find a definitive source as to why, or why I would consider one over the other or what the practical tradeoff would be.

ADD REPLY • link 8 months ago by notthebmovieactor • 0

0

Entering edit mode

There is some difference in what data are stored; take a look at the supplementary data document for bigWig format details. You might also want to look at the source code for BedGraph to bigWig conversion to see how indexing is done for the different step-formats.

As far as tradeoffs go, a 10x size difference would seem substantial if you need to store a lot of datasets. You might think about what your variables are, in terms of what time you have available for compression, and what disk storage you have available for compressed data.

ADD REPLY • link 8 months ago by Alex Reynolds 36k

0

Entering edit mode

Another piece of source that may be easier to look through is libBigWig from Devon Ryan's pyBigWig kit. Based on the comments in one header file, I'm curious to know if you are mixing fixed and variable step data in one dataset, as it appears that may create additional blocks, each with their own overhead.

ADD REPLY • link 8 months ago by Alex Reynolds 36k

0

Entering edit mode

No mixing. It's all fixed or all variable. My issue could be specific to wigToBigWig. I am only assuming each variableStep declaration creates a new node in the R tree but maybe for some minimum interval?

Ideally would like to be able reason and justify my output choice by citing some source other than looking it up in the source code itself.

ADD REPLY • link 8 months ago by notthebmovieactor • 0