Question

multi-sample binary indexed coverage file format

1

Entering edit mode

4.2 years ago

kevin.stachelek ▴ 80

I am using wiggleplotr in a shiny app to visualize read coverage between sets of single cell bigwigs. I have chosen this solution because I cannot afford the memory requirments of loading read coverage into memory, rather specific genomic regions can be queried from bigwig. This requires that I keep track of filepaths of corresponding bigwigs for every cell.

Right now I am using a sqlite database to keep track of these file paths but this seems brittle. It is a challenge to keep it up to date. Is there a binary-indexed multisample file format that I could use instead? Can bigwig be multisample?

RNA-Seq sequencing R • 1.7k views

ADD COMMENT • link 2.8 years ago by kevin.stachelek ▴ 80

1

Entering edit mode

bedgraph can contain arbitarily many columns and could be considered binary (bgzip'd) and indexable (tabix) but it is not as efficient as a true binary encoding. would also be curious in this

ADD REPLY • link 2.8 years ago by cmdcolin ★ 4.0k

0

Entering edit mode

I also thought I could write genomicranges to an hdf5 file but it doesn't seem like there's a method for this in rhdf5?

ADD REPLY • link 4.2 years ago by kevin.stachelek ▴ 80

0

Entering edit mode

genomicssqllite also seems like an interesting approach. Though it doesn't have R bindings supported.

ADD REPLY • link 4.2 years ago by kevin.stachelek ▴ 80

score 2 · Answer 1 · 2022-02-21

2

Entering edit mode

2.8 years ago

Kaur ▴ 20

Did you find a solution to this? D4 file format (https://github.com/38/d4-format) looks promising, but it's still single sample and there's no R package. Another option might be to store the coverage information in indexed parquet files and use the arrow R package to read specific regions, but I have not tried that. Another option might be to use DuckDb (https://duckdb.org/2021/06/25/querying-parquet.html) on top of a large number of parquet files (e.g. one per sample).

I haven't tried any of those but we are starting run into a similar problem and would love to find a solution. The solution might involve making some changes to the wiggleplotr package, but that can be done if there's a compelling argument.

ADD COMMENT • link 2.8 years ago by Kaur ▴ 20

0

Entering edit mode

Both D4 and parquet files have the issue of one sample/one file. It also seems like parquet files offer no advantage in terms of file size. Is reading faster than with bigwigs?

I'll be very interested if you do find a promising way forward and decide to make changes to wiggleplotr. Thanks for a very useful package!

ADD REPLY • link 2.8 years ago by kevin.stachelek ▴ 80

0

Entering edit mode

Would it be possible to intersect all of the sample bigwigs and output to a single parquet file with coverage information for each sample then query with dbplyr? Would this be less efficient?

ADD REPLY • link 2.8 years ago by kevin.stachelek ▴ 80