multi-sample binary indexed coverage file format
1
1
Entering edit mode
4.2 years ago

I am using wiggleplotr in a shiny app to visualize read coverage between sets of single cell bigwigs. I have chosen this solution because I cannot afford the memory requirments of loading read coverage into memory, rather specific genomic regions can be queried from bigwig. This requires that I keep track of filepaths of corresponding bigwigs for every cell.

Right now I am using a sqlite database to keep track of these file paths but this seems brittle. It is a challenge to keep it up to date. Is there a binary-indexed multisample file format that I could use instead? Can bigwig be multisample?

RNA-Seq sequencing R • 1.7k views
ADD COMMENT
1
Entering edit mode

bedgraph can contain arbitarily many columns and could be considered binary (bgzip'd) and indexable (tabix) but it is not as efficient as a true binary encoding. would also be curious in this

ADD REPLY
0
Entering edit mode

I also thought I could write genomicranges to an hdf5 file but it doesn't seem like there's a method for this in rhdf5?

ADD REPLY
0
Entering edit mode

genomicssqllite also seems like an interesting approach. Though it doesn't have R bindings supported.

ADD REPLY
2
Entering edit mode
2.8 years ago
Kaur ▴ 20

Did you find a solution to this? D4 file format (https://github.com/38/d4-format) looks promising, but it's still single sample and there's no R package. Another option might be to store the coverage information in indexed parquet files and use the arrow R package to read specific regions, but I have not tried that. Another option might be to use DuckDb (https://duckdb.org/2021/06/25/querying-parquet.html) on top of a large number of parquet files (e.g. one per sample).

I haven't tried any of those but we are starting run into a similar problem and would love to find a solution. The solution might involve making some changes to the wiggleplotr package, but that can be done if there's a compelling argument.

ADD COMMENT
0
Entering edit mode

Both D4 and parquet files have the issue of one sample/one file. It also seems like parquet files offer no advantage in terms of file size. Is reading faster than with bigwigs?

I'll be very interested if you do find a promising way forward and decide to make changes to wiggleplotr. Thanks for a very useful package!

ADD REPLY
0
Entering edit mode

Would it be possible to intersect all of the sample bigwigs and output to a single parquet file with coverage information for each sample then query with dbplyr? Would this be less efficient?

ADD REPLY

Login before adding your answer.

Traffic: 1530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6