Is there a way to subsample Hi-C at Hi-C matrix .cool file?
1
0
Entering edit mode
3.6 years ago
sckinta ▴ 730

To make Hi-C samples from different biological conditions comparable in downstream analysis (eg. call hi-C loops), one way is to subsample Hi-C data. Although subsample at bam file level is the regular way to go, I can only access .cool file at this moment. Is there a way to subsample Hi-C at .cool file? I want to use subsampled Hi-C matrix to call loops using fit-hic-2.

Hi-C loops • 1.6k views
ADD COMMENT
0
Entering edit mode
15 months ago
kalavattam ▴ 280

Yes, there's a cooltools (link) utility, random-sample (link), to do so.

You can also use, for example, Numpy to downsample cool files:

#!/usr/bin/env python
"""
Randomly sample (without replacement) a .cool file.
"""
import argparse
import cooler
import h5py
import numpy as np
import sys


def main():
    ap = argparse.ArgumentParser(
        description='Fast random sampling of a .cool file.'
                    'Required memory: 4 * total_reads bytes.',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    ap.add_argument(
        "-i",
        "--infile",
        dest="infile",
        required=True,
        type=str,
        help="Input *.cool to downsample"
    )
    ap.add_argument(
        "-o",
        "--outfile",
        dest="outfile",
        required=True,
        type=str,
        help="Output downsampled *.cool"
    )
    ap.add_argument(
        "-s",
        "--sample_size",
        dest="sample_size",
        required=True,
        type=float,
        help="If integer, number of reads to sample;"
             "if float, fraction of reads to sample"
    )

    args = ap.parse_args()

    infile = args.infile
    outfile = args.outfile
    sample_size = args.sample_size

    fh = h5py.File(infile, 'r')

    w = fh['pixels/count'][:]
    w = w.astype('int64')
    wlen = len(w)
    wsum = np.sum(w)

    if sample_size <= 1:
        sample_size = sample_size * wsum

    sample_size = int(sample_size)
    print(
        infile, '; sampling', sample_size, 'reads from', wsum, '...',
        file=sys.stderr
    )

    ws_bins, ws_counts = np.unique(
        np.random.choice(
            np.repeat(np.arange(wlen), w), size=sample_size, replace=False
        ),
        return_counts=True
    )

    y = np.zeros(wlen)
    y[ws_bins] = ws_counts

    print('Writing...', file=sys.stderr)

    c = cooler.Cooler(fh)
    cooler.create_cooler(
        outfile,
        bins=c.bins()[:],
        pixels={
            'bin1_id': fh['pixels/bin1_id'][:][ws_bins],
            'bin2_id': fh['pixels/bin2_id'][:][ws_bins],
            'count': ws_counts
        },
        assembly=fh.attrs['genome-assembly'],
        ordered=True
    )

    fh.close()

    print('Done.', file=sys.stderr)


main()
ADD COMMENT
0
Entering edit mode

The program FAN-C also has a utitlity for random subsampling of a matrix: fanc downsample / fanc hic --downsample

For example, via the CLI:

#!/bin/bash

fanc downsample --help

# 2023-10-10 15:21:17,645 INFO FAN-C version: 0.9.27
# *** fanc downsample is deprecated. Please use fanc hic --downsample instead! ***
# usage: fanc downsample [-h] [-tmp] hic n output
# 
# Downsample contacts from a Hic object.
# 
# positional arguments:
#   hic                  Hic object to be downsampled.
#   n                    Sample size or reference Hi-C object. If sample size is < 1,will be interpreted as a fraction of valid pairs.
#   output               Downsampled Hic output.
# 
# options:
#   -h, --help           show this help message and exit
#   -tmp, --work-in-tmp  Work in temporary directory

Or, for example, via the Python API:

#!/usr/bin/env python3

import sys
import os
import fanc
import numpy as np


def calculate_contact_sum(hic_file):
    hic = fanc.load(hic_file)
    contact_sum = 0
    for value in hic.edge_data(hic._default_score_field, lazy=True):
        if np.isfinite(value):
            contact_sum += value
    return contact_sum, hic


file_1 = "path/to/file/sample_1.hic"
file_2 = "path/to/file/sample_2.hic"
#  Works with .cool files as well

sum_1, hic_1 = calculate_contact_sum(file_1)
sum_2, hic_2 = calculate_contact_sum(file_2)

sums = {'file_1': sum_1, 'file_2': sum_2}
min_key = min(sums, key=sums.get)
min_value = sums[min_key]

hic_1.downsample(
    min_value,
    file_name = "path/to/outfile_1.hic"  # or .cool
)
hic_2.downsample(
    min_value,
    file_name = "path/to/outfile_2.hic"  # or .cool
)
ADD REPLY

Login before adding your answer.

Traffic: 1995 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6