1)What is the correct matrix format before converting to BEDPE?
The format you're getting is correct. See below for details.
2)Should the output include chromosome names, or do I need to add them manually?
You need to add them manually. See below for an example of how to do this.
3)Are there any additional steps I should take to format the data correctly for BEDPE conversion?
Yeah. There are a bunch of ways to do this. See below for an example.
The Juicer Tools (juicer_tools_2.13.06.jar
) dump
command outputs observed matrices in sparse matrix format with three columns:
start1 start2 contact_frequency
So, your output is expected and correctly formatted.
Let's break down your call to Juicer Tools dump
:
java -jar juicer_tools_2.13.06.jar dump \
observed NONE contact_matrix_GRCh38.hic \
1 2 BP 1000000 \
chr1_chr2.matrix
You specified the following:
- Observed contacts:
observed
- Unadjusted (non-normalized) contacts:
NONE
- Interchromosomal contacts between chromosomes 1 and 2:
1 2
- Binned at 1,000,000-bp resolution:
BP 1000000
- Input file:
contact_matrix_GRCh38.hic
- Output file:
chr1_chr2.matrix
The first four lines of output look like this:
0 0 13.0
1000000 0 79.0
2000000 0 99.0
3000000 0 153.0
...
Each row represents a contact between two bins:
- Column 1 (
start1
): Start position of a bin in chromosome 1
- Column 2 (
start2
): Start position of a bin in chromosome 2
- Column 3 (
contact_frequency
): Number of Hi-C contacts (unadjusted because NONE
) between these two bins
Using this info, you can convert the sparse matrix into a BEDPE file with AWK; for example:
hic="contact_matrix_GRCh38.hic"
chr_a="1"
chr_b="2"
bin=1000000
java -jar juicer_tools.jar dump \
observed NONE "${hic}" \
"${chr_a}" "${chr_b}" BP "${bin}" \
"${chr_a}_${chr_b}.matrix"
awk \
-v chr_a="${chr_a}" \
-v chr_b="${chr_b}" \
-v bin="${bin}" \
'BEGIN { OFS="\t" } {
start_1 = $1;
end1 = start_1 + bin;
start_2 = $2;
end2 = start_2 + bin;
score = $3;
print chr_a, start_1, end1, chr_b, start_2, end2, ".", score, ".", "."
}' \
"${chr_a}_${chr_b}.matrix" \
> "${chr_a}_${chr_b}.bedpe"
The above code uses the chromosome and bin information to fill in and wrangle the sparse matrix data, generating a BEDPE file with the following format:
chrom_a start_a end_a chrom_b start_b end_b name score strand_a strand_b
It is standard to assign .
(periods) to name
, strand_a
, and strand_b
when those column are unused. See here for more details on the BEDPE format.
If you want to generate BEDPE files for all interchromosomal pairs while avoiding redundant ones (e.g., skipping 2
by 1
since 1
by 2
is already processed), you can do this through Shell scripting; for example:
function generate_sparse() {
local jar="${1}"
local cntc="${2}"
local nrm="${3}"
local hic="${4}"
local chr_a="${5}"
local chr_b="${6}"
local typ="${7}"
local bin="${8}"
local mat="${9}"
java -jar "${jar}" dump \
"${cntc}" "${nrm}" "${hic}" \
"${chr_a}" "${chr_b}" "${typ}" "${bin}" \
"${mat}"
}
function convert_sparse_bedpe() {
local chr_a="${1}"
local chr_b="${2}"
local bin="${3}"
local mat="${4}"
local bedpe="${5}"
awk \
-v chr_a="${chr_a}" \
-v chr_b="${chr_b}" \
-v bin="${bin}" \
'BEGIN { OFS="\t" } {
start_1 = $1;
end1 = start_1 + bin;
start_2 = $2;
end2 = start_2 + bin;
score = $3;
print chr_a, start_1, end1, chr_b, start_2, end2, ".", score, ".", "."
}' \
"${mat}" \
> "${bedpe}"
}
function check_file() {
local fil="${1}"
local nam="${2}"
if [[ ! -s "${fil}" ]]; then
echo "Error: ${nam} file was not generated: '${fil}'." >&2
return 1
fi
}
jar="${HOME}/path/to/juicer_tools_2.13.06.jar"
hic="${HOME}/path/to/contact_matrix_GRCh38.hic"
cntc="observed"
nrm="NONE"
typ="BP"
bin=1000000
out_mat="${HOME}/path/to/output/matrices"
out_bdp="${HOME}/path/to/output/bedpes"
chrs=( $(seq 1 22) X )
mkdir -p "${out_mat}" "${out_bdp}"
for ((i = 0; i < ${#chrs[@]}; i++)); do
for ((j = i + 1; j < ${#chrs[@]}; j++)); do
chr_a="${chrs[i]}"
chr_b="${chrs[j]}"
fil_mat="${out_mat}/${chr_a}_${chr_b}.matrix"
fil_bdp="${out_bdp}/${chr_a}_${chr_b}.bedpe"
generate_sparse \
"${jar}" "${cntc}" "${nrm}" "${hic}" \
"${chr_a}" "${chr_b}" "${typ}" "${bin}" \
"${fil_mat}"
check_file "${fil_mat}" "MATRIX"
convert_sparse_bedpe \
"${chr_a}" "${chr_b}" "${bin}" \
"${fil_mat}" "${fil_bdp}"
check_file "${fil_bdp}" "BEDPE"
done
done
To speed things up, you can use GNU Parallel to distribute execution across multiple CPU cores. Note that the below _does not_ avoid redundant pairs (e.g., 1
by 2
and 2
by 1
will both be processed).
export -f generate_sparse
export -f convert_sparse_bedpe
job=4
parallel --jobs "${job}" \
'generate_sparse {1} {2} {3} {4} {5} {6} {7} {8} {9}/{5}_{6}.matrix'
::: "${jar}" \
::: "${cntc}" \
::: "${nrm}" \
::: "${hic}" \
::: "${chrs[@]}" \
:::+ "${chrs[@]}" \
::: "${typ}" \
::: "${bin}" \
::: "${out_mat}"
parallel --jobs "${job}" \
'convert_sparse_bedpe {1} {2} {3} {4}/{1}_{2}.matrix {5}/{1}_{2}.bedpe' \
::: "${chrs[@]}" \
:::+ "${chrs[@]}" \
::: "${bin}" \
::: "${out_mat}" \
::: "${out_bdp}"
helpful I meant **