Correct format for Hi-C matrix before converting to BEDPE
1
0
Entering edit mode
9 weeks ago
desptsp • 0

I am working with interchromosomal Hi-C interaction data and need to extract a matrix from a .hic file using Juicer Tools (v2.13.06). My goal is to convert this matrix into BEDPE format for downstream analysis.

Currently, when I use the following command:

java -jar juicer_tools_2.13.06.jar dump observed NONE contact_matrix_GRCh38.hic 1 2 BP 1000000 chr1_chr2.matrix

i get an output like this :

0    0    13.0  
1000000    0    79.0  
2000000    0    99.0  
3000000    0    153.0  
...

So i have the following questions if anyone can answer it would be very careful : 1)What is the correct matrix format before converting to BEDPE? 2)Should the output include chromosome names, or do I need to add them manually? 3)Are there any additional steps I should take to format the data correctly for BEDPE conversion?

Thank you very much!

juicer matrix bedpe HI-C • 660 views
ADD COMMENT
0
Entering edit mode

helpful I meant **

ADD REPLY
0
Entering edit mode
9 weeks ago
kalavattam ▴ 350

1)What is the correct matrix format before converting to BEDPE?

The format you're getting is correct. See below for details.

2)Should the output include chromosome names, or do I need to add them manually?

You need to add them manually. See below for an example of how to do this.

3)Are there any additional steps I should take to format the data correctly for BEDPE conversion?

Yeah. There are a bunch of ways to do this. See below for an example.


The Juicer Tools (juicer_tools_2.13.06.jar) dump command outputs observed matrices in sparse matrix format with three columns:

start1    start2    contact_frequency

So, your output is expected and correctly formatted.

Let's break down your call to Juicer Tools dump:

java -jar juicer_tools_2.13.06.jar dump \
    observed NONE contact_matrix_GRCh38.hic \
    1 2 BP 1000000 \
    chr1_chr2.matrix

You specified the following:

  • Observed contacts: observed
  • Unadjusted (non-normalized) contacts: NONE
  • Interchromosomal contacts between chromosomes 1 and 2: 1 2
  • Binned at 1,000,000-bp resolution: BP 1000000
  • Input file: contact_matrix_GRCh38.hic
  • Output file: chr1_chr2.matrix

The first four lines of output look like this:

0    0    13.0  
1000000    0    79.0  
2000000    0    99.0  
3000000    0    153.0  
...

Each row represents a contact between two bins:

  • Column 1 (start1): Start position of a bin in chromosome 1
  • Column 2 (start2): Start position of a bin in chromosome 2
  • Column 3 (contact_frequency): Number of Hi-C contacts (unadjusted because NONE) between these two bins

Using this info, you can convert the sparse matrix into a BEDPE file with AWK; for example:

#  Set some parameters
hic="contact_matrix_GRCh38.hic"
chr_a="1"
chr_b="2"
bin=1000000

#  Extract interchromosomal contacts between chromosomes 1 and 2
java -jar juicer_tools.jar dump \
    observed NONE "${hic}" \
    "${chr_a}" "${chr_b}" BP "${bin}" \
    "${chr_a}_${chr_b}.matrix"

#  Convert sparse matrix to BEDPE format
awk \
    -v chr_a="${chr_a}" \
    -v chr_b="${chr_b}" \
    -v bin="${bin}" \
    'BEGIN { OFS="\t" } {
        start_1 = $1;
        end1 = start_1 + bin;
        start_2 = $2;
        end2 = start_2 + bin;
        score = $3;
        print chr_a, start_1, end1, chr_b, start_2, end2, ".", score, ".", "."
    }' \
    "${chr_a}_${chr_b}.matrix" \
        > "${chr_a}_${chr_b}.bedpe"

The above code uses the chromosome and bin information to fill in and wrangle the sparse matrix data, generating a BEDPE file with the following format:

chrom_a    start_a    end_a    chrom_b    start_b    end_b    name    score    strand_a    strand_b

It is standard to assign . (periods) to name, strand_a, and strand_b when those column are unused. See here for more details on the BEDPE format.

If you want to generate BEDPE files for all interchromosomal pairs while avoiding redundant ones (e.g., skipping 2 by 1 since 1 by 2 is already processed), you can do this through Shell scripting; for example:

#  Function for sparse matrix generation
function generate_sparse() {
    local jar="${1}"    # Path to the Juicer Tools .jar file
    local cntc="${2}"   # Contact type, e.g., "observed" or "expected"
    local nrm="${3}"    # Normalization type, e.g., "NONE", "SCALE", etc.‡
    local hic="${4}"    # Path to .hic input file
    local chr_a="${5}"  # First chromosome
    local chr_b="${6}"  # Second chromosome
    local typ="${7}"    # "BP" or "FRAG"
    local bin="${8}"    # Bin size in bp
    local mat="${9}"    # Path to .matrix output file

    java -jar "${jar}" dump \
        "${cntc}" "${nrm}" "${hic}" \
        "${chr_a}" "${chr_b}" "${typ}" "${bin}" \
        "${mat}"
}  #  ‡For "balanced" contacts, use "KR" for Juicer v1, "SCALE" for Juicer v2


#  Function for AWK conversion from sparse matrix to BEDPE
function convert_sparse_bedpe() {
    local chr_a="${1}"  # First chromosome
    local chr_b="${2}"  # Second chromosome
    local bin="${3}"    # Bin size in bp
    local mat="${4}"    # Path to .matrix input file
    local bedpe="${5}"  # Path to .bedpe output file

    awk \
        -v chr_a="${chr_a}" \
        -v chr_b="${chr_b}" \
        -v bin="${bin}" \
        'BEGIN { OFS="\t" } {
            start_1 = $1;
            end1 = start_1 + bin;
            start_2 = $2;
            end2 = start_2 + bin;
            score = $3;
            print chr_a, start_1, end1, chr_b, start_2, end2, ".", score, ".", "."
        }' \
        "${mat}" \
            > "${bedpe}"
}


#  Function to check file existence (e.g., MATRIX, BEDPE files)
function check_file() {
    local fil="${1}"
    local nam="${2}"

    if [[ ! -s "${fil}" ]]; then
        echo "Error: ${nam} file was not generated: '${fil}'." >&2
        return 1
    fi
}


#  Define parameters (change as needed)
jar="${HOME}/path/to/juicer_tools_2.13.06.jar"
hic="${HOME}/path/to/contact_matrix_GRCh38.hic"
cntc="observed"
nrm="NONE"
typ="BP"
bin=1000000
out_mat="${HOME}/path/to/output/matrices"
out_bdp="${HOME}/path/to/output/bedpes"

#  Create an array of chromosomes to process
chrs=( $(seq 1 22) X )

#  Create output directories
mkdir -p "${out_mat}" "${out_bdp}"

#  Run serial BEDPE file generation, avoiding redundant pairings
for ((i = 0; i < ${#chrs[@]}; i++)); do
    for ((j = i + 1; j < ${#chrs[@]}; j++)); do
        chr_a="${chrs[i]}"
        chr_b="${chrs[j]}"
        fil_mat="${out_mat}/${chr_a}_${chr_b}.matrix"
        fil_bdp="${out_bdp}/${chr_a}_${chr_b}.bedpe"

        #  Generate sparse matrix
        generate_sparse \
            "${jar}" "${cntc}" "${nrm}" "${hic}" \
            "${chr_a}" "${chr_b}" "${typ}" "${bin}" \
            "${fil_mat}"

        #  Check that MATRIX file was created
        check_file "${fil_mat}" "MATRIX"

        #  Convert sparse matrix to BEDPE
        convert_sparse_bedpe \
            "${chr_a}" "${chr_b}" "${bin}" \
            "${fil_mat}" "${fil_bdp}"

        #  Check that BEDPE file was created
        check_file "${fil_bdp}" "BEDPE"
    done
done

To speed things up, you can use GNU Parallel to distribute execution across multiple CPU cores. Note that the below _does not_ avoid redundant pairs (e.g., 1 by 2 and 2 by 1 will both be processed).

## OPTIONAL: Run parallel generation of MATRIX and BEDPE with GNU Parallel ##

#  Export functions for use in GNU Parallel
export -f generate_sparse
export -f convert_sparse_bedpe

job=4  ## Adjust based on system resources ##

## WARNING: Redundant pairs will be processed, e.g., both 1 by 2 and 2 by 1 ##

parallel --jobs "${job}" \
    'generate_sparse {1} {2} {3} {4} {5} {6} {7} {8} {9}/{5}_{6}.matrix'
    ::: "${jar}" \
    ::: "${cntc}" \
    ::: "${nrm}" \
    ::: "${hic}" \
    ::: "${chrs[@]}" \
    :::+ "${chrs[@]}" \
    ::: "${typ}" \
    ::: "${bin}" \
    ::: "${out_mat}"

parallel --jobs "${job}" \
    'convert_sparse_bedpe {1} {2} {3} {4}/{1}_{2}.matrix {5}/{1}_{2}.bedpe' \
    ::: "${chrs[@]}" \
    :::+ "${chrs[@]}" \
    ::: "${bin}" \
    ::: "${out_mat}" \
    ::: "${out_bdp}"
ADD COMMENT
0
Entering edit mode

Just another quick note: These BEDPE files can be quite large—though it may not be a big issue given your focus on interchromosomal contacts at a coarse resolution.

Still, depending on your downstream analyses, it might be worth considering subsetting or filtering regions of interest rather than processing the whole dataset. You could also preprocess the data to extract statistically significant contacts using tools like Fit-Hi-C2 or HiC-DC+.

If storage is a concern, you can pipe the AWK command into gzip to compress the BEDPE output:

awk \
    -v chr_a="${chr_a}" \
    -v chr_b="${chr_b}" \
    -v bin="${bin}" \
    'BEGIN { OFS="\t" } {
        start_1 = $1;
        end1 = start_1 + bin;
        start_2 = $2;
        end2 = start_2 + bin;
        score = $3;
        print chr_a, start_1, end1, chr_b, start_2, end2, ".", score, ".", "."
    }' \
    "${mat}" \
        | gzip > "${bedpe}.gz"

You can do something similar for the matrix generation code, but then you’ll need to adjust the call to AWK to handle compressed input. The easiest approach is to decompress on-the-fly with, e.g., gunzip -c or zcat:

zcat "${compressed_mat}" \
    | awk \
        -v chr_a="${chr_a}" \
        -v chr_b="${chr_b}" \
        -v bin="${bin}" \
        'BEGIN { OFS="\t" } {
            ...
        }' \
        "${mat}" \
    | gzip > "${bedpe}.gz"
ADD REPLY
0
Entering edit mode

Thank you very much for all your answers!!! Everything was very helpful! Do you know also the instruction " straw" for juicer tools ? It converts directly from hic to bedpe file i think

ADD REPLY
0
Entering edit mode

My understanding is you will still need to post-process the output from straw to obtain data in BEDPE format. You can look into this more at the straw documentation page and in this example using pystraw. In particular, see the final bullet point and corresponding code chunk—you’ll want to adapt this for your purpose:

Extract all interchromosomal reads between chromosome 5 and chromosome 12 at 500 fragment resolution with VC (vanilla coverage) normalization:

import straw

result = straw.straw("VC", "HIC001.hic", "5", "12", "FRAG", 500)
# the values returned are in results
for i in range(len(result[0])):
   print("{0}\t{1}\t{2}".format(result[0][i], result[1][i], result[2][i]))

You'll want to change VC to NONE, HIC001.hic to whatever the name of your file is, 5 to whatever the first chromosome is, 12 to whatever the second chromosome is, FRAG to BP, and 500 to 1000000.

The output is in sparse matrix format (as described in my answer above), so you’ll need to process it into BEDPE format. This can be done using command-line tools (some options outlined above) or handled in Python too.

ADD REPLY

Login before adding your answer.

Traffic: 2090 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6