Question

Correct format for Hi-C matrix before converting to BEDPE

0

Entering edit mode

9 weeks ago

desptsp • 0

I am working with interchromosomal Hi-C interaction data and need to extract a matrix from a .hic file using Juicer Tools (v2.13.06). My goal is to convert this matrix into BEDPE format for downstream analysis.

Currently, when I use the following command:

java -jar juicer_tools_2.13.06.jar dump observed NONE contact_matrix_GRCh38.hic 1 2 BP 1000000 chr1_chr2.matrix

i get an output like this :

0    0    13.0  
1000000    0    79.0  
2000000    0    99.0  
3000000    0    153.0  
...

So i have the following questions if anyone can answer it would be very careful : 1)What is the correct matrix format before converting to BEDPE? 2)Should the output include chromosome names, or do I need to add them manually? 3)Are there any additional steps I should take to format the data correctly for BEDPE conversion?

Thank you very much!

juicer matrix bedpe HI-C • 660 views

ADD COMMENT • link updated 8 weeks ago by kalavattam ▴ 350 • written 9 weeks ago by desptsp • 0

0

Entering edit mode

helpful I meant **

ADD REPLY • link 9 weeks ago by desptsp • 0

score 0 · Answer 1 · 2025-03-02

1)What is the correct matrix format before converting to BEDPE?

The format you're getting is correct. See below for details.

2)Should the output include chromosome names, or do I need to add them manually?

You need to add them manually. See below for an example of how to do this.

3)Are there any additional steps I should take to format the data correctly for BEDPE conversion?

Yeah. There are a bunch of ways to do this. See below for an example.

The Juicer Tools (juicer_tools_2.13.06.jar) dump command outputs observed matrices in sparse matrix format with three columns:

start1    start2    contact_frequency

So, your output is expected and correctly formatted.

Let's break down your call to Juicer Tools dump:

java -jar juicer_tools_2.13.06.jar dump \
    observed NONE contact_matrix_GRCh38.hic \
    1 2 BP 1000000 \
    chr1_chr2.matrix

You specified the following:

Observed contacts: observed
Unadjusted (non-normalized) contacts: NONE
Interchromosomal contacts between chromosomes 1 and 2: 1 2
Binned at 1,000,000-bp resolution: BP 1000000
Input file: contact_matrix_GRCh38.hic
Output file: chr1_chr2.matrix

The first four lines of output look like this:

0    0    13.0  
1000000    0    79.0  
2000000    0    99.0  
3000000    0    153.0  
...

Each row represents a contact between two bins:

Column 1 (start1): Start position of a bin in chromosome 1
Column 2 (start2): Start position of a bin in chromosome 2
Column 3 (contact_frequency): Number of Hi-C contacts (unadjusted because NONE) between these two bins

Using this info, you can convert the sparse matrix into a BEDPE file with AWK; for example:

#  Set some parameters
hic="contact_matrix_GRCh38.hic"
chr_a="1"
chr_b="2"
bin=1000000

#  Extract interchromosomal contacts between chromosomes 1 and 2
java -jar juicer_tools.jar dump \
    observed NONE "${hic}" \
    "${chr_a}" "${chr_b}" BP "${bin}" \
    "${chr_a}_${chr_b}.matrix"

#  Convert sparse matrix to BEDPE format
awk \
    -v chr_a="${chr_a}" \
    -v chr_b="${chr_b}" \
    -v bin="${bin}" \
    'BEGIN { OFS="\t" } {
        start_1 = $1;
        end1 = start_1 + bin;
        start_2 = $2;
        end2 = start_2 + bin;
        score = $3;
        print chr_a, start_1, end1, chr_b, start_2, end2, ".", score, ".", "."
    }' \
    "${chr_a}_${chr_b}.matrix" \
        > "${chr_a}_${chr_b}.bedpe"

The above code uses the chromosome and bin information to fill in and wrangle the sparse matrix data, generating a BEDPE file with the following format:

chrom_a    start_a    end_a    chrom_b    start_b    end_b    name    score    strand_a    strand_b

It is standard to assign . (periods) to name, strand_a, and strand_b when those column are unused. See here for more details on the BEDPE format.

If you want to generate BEDPE files for all interchromosomal pairs while avoiding redundant ones (e.g., skipping 2 by 1 since 1 by 2 is already processed), you can do this through Shell scripting; for example:

#  Function for sparse matrix generation
function generate_sparse() {
    local jar="${1}"    # Path to the Juicer Tools .jar file
    local cntc="${2}"   # Contact type, e.g., "observed" or "expected"
    local nrm="${3}"    # Normalization type, e.g., "NONE", "SCALE", etc.‡
    local hic="${4}"    # Path to .hic input file
    local chr_a="${5}"  # First chromosome
    local chr_b="${6}"  # Second chromosome
    local typ="${7}"    # "BP" or "FRAG"
    local bin="${8}"    # Bin size in bp
    local mat="${9}"    # Path to .matrix output file

    java -jar "${jar}" dump \
        "${cntc}" "${nrm}" "${hic}" \
        "${chr_a}" "${chr_b}" "${typ}" "${bin}" \
        "${mat}"
}  #  ‡For "balanced" contacts, use "KR" for Juicer v1, "SCALE" for Juicer v2


#  Function for AWK conversion from sparse matrix to BEDPE
function convert_sparse_bedpe() {
    local chr_a="${1}"  # First chromosome
    local chr_b="${2}"  # Second chromosome
    local bin="${3}"    # Bin size in bp
    local mat="${4}"    # Path to .matrix input file
    local bedpe="${5}"  # Path to .bedpe output file

    awk \
        -v chr_a="${chr_a}" \
        -v chr_b="${chr_b}" \
        -v bin="${bin}" \
        'BEGIN { OFS="\t" } {
            start_1 = $1;
            end1 = start_1 + bin;
            start_2 = $2;
            end2 = start_2 + bin;
            score = $3;
            print chr_a, start_1, end1, chr_b, start_2, end2, ".", score, ".", "."
        }' \
        "${mat}" \
            > "${bedpe}"
}


#  Function to check file existence (e.g., MATRIX, BEDPE files)
function check_file() {
    local fil="${1}"
    local nam="${2}"

    if [[ ! -s "${fil}" ]]; then
        echo "Error: ${nam} file was not generated: '${fil}'." >&2
        return 1
    fi
}


#  Define parameters (change as needed)
jar="${HOME}/path/to/juicer_tools_2.13.06.jar"
hic="${HOME}/path/to/contact_matrix_GRCh38.hic"
cntc="observed"
nrm="NONE"
typ="BP"
bin=1000000
out_mat="${HOME}/path/to/output/matrices"
out_bdp="${HOME}/path/to/output/bedpes"

#  Create an array of chromosomes to process
chrs=( $(seq 1 22) X )

#  Create output directories
mkdir -p "${out_mat}" "${out_bdp}"

#  Run serial BEDPE file generation, avoiding redundant pairings
for ((i = 0; i < ${#chrs[@]}; i++)); do
    for ((j = i + 1; j < ${#chrs[@]}; j++)); do
        chr_a="${chrs[i]}"
        chr_b="${chrs[j]}"
        fil_mat="${out_mat}/${chr_a}_${chr_b}.matrix"
        fil_bdp="${out_bdp}/${chr_a}_${chr_b}.bedpe"

        #  Generate sparse matrix
        generate_sparse \
            "${jar}" "${cntc}" "${nrm}" "${hic}" \
            "${chr_a}" "${chr_b}" "${typ}" "${bin}" \
            "${fil_mat}"

        #  Check that MATRIX file was created
        check_file "${fil_mat}" "MATRIX"

        #  Convert sparse matrix to BEDPE
        convert_sparse_bedpe \
            "${chr_a}" "${chr_b}" "${bin}" \
            "${fil_mat}" "${fil_bdp}"

        #  Check that BEDPE file was created
        check_file "${fil_bdp}" "BEDPE"
    done
done

To speed things up, you can use GNU Parallel to distribute execution across multiple CPU cores. Note that the below _does not_ avoid redundant pairs (e.g., 1 by 2 and 2 by 1 will both be processed).

## OPTIONAL: Run parallel generation of MATRIX and BEDPE with GNU Parallel ##

#  Export functions for use in GNU Parallel
export -f generate_sparse
export -f convert_sparse_bedpe

job=4  ## Adjust based on system resources ##

## WARNING: Redundant pairs will be processed, e.g., both 1 by 2 and 2 by 1 ##

parallel --jobs "${job}" \
    'generate_sparse {1} {2} {3} {4} {5} {6} {7} {8} {9}/{5}_{6}.matrix'
    ::: "${jar}" \
    ::: "${cntc}" \
    ::: "${nrm}" \
    ::: "${hic}" \
    ::: "${chrs[@]}" \
    :::+ "${chrs[@]}" \
    ::: "${typ}" \
    ::: "${bin}" \
    ::: "${out_mat}"

parallel --jobs "${job}" \
    'convert_sparse_bedpe {1} {2} {3} {4}/{1}_{2}.matrix {5}/{1}_{2}.bedpe' \
    ::: "${chrs[@]}" \
    :::+ "${chrs[@]}" \
    ::: "${bin}" \
    ::: "${out_mat}" \
    ::: "${out_bdp}"