For a project involving ChIP-seq data, I am working to implement the ENCODE consortium data standards for library complexity. One metric of interest is M_DISTINCT, which is defined as the "number of distinct genomic locations to which some read maps uniquely."
I have thought of several methods to calculate M_DISTINCT and seek advice on their validity or suggestions for improvement.
Here's the script I used, which is also a minimal reproducible example that others can try. It requires samtools
and awk
. I have provided a small (~22 MB) BAM file of paired-end alignments via a shared link. This BAM file has been filtered with samtools view -f 2
to ensure all alignments are paired.
#!/bin/bash
# Define functions -----------------------------------------------------------
# Function to process samtools view output and count unique positions via the
#+ creation of a "position ID" from the following SAM fields: RNAME ($3), POS
#+ ($4), and PNEXT ($8; for paired-end alignments) or RNEXT ($7; for single-end
#+ alignments)
function sort_count_by_pos() {
awk '{
if ($7 == "=") {
# Handle paired-end alignments
if ($4 < $8) {
pos = $3"_"$4"_"($8 - $4 + 1)
} else {
pos = $3"_"$8"_"($4 - $8 + 1)
}
} else {
# Handle single-end alignments
pos = $3"_"$4"_"$7
}
print pos
}' \
| sort -u \
| wc -l \
| sed 's: ::g'
}
# Function to evaluate fragments from paired-end alignments via the creation
#+ of a "fragment ID" from the following SAM fields: RNAME ($3), POS ($4), and
#+ PNEXT ($8), and TLEN ($9)
function sort_count_by_frag() {
awk '{
if ($9 > 0) {
frag = $3"_"$4"_"$8"_"$9
} else {
frag = $3"_"$8"_"$4"_"(-$9)
}
print frag
}' \
| sort -u \
| wc -l \
| sed 's: ::g'
}
# Assign variables -----------------------------------------------------------
thr=${SLURM_CPUS_ON_NODE:-4}
bam="in_Q_Hho1_6336.sort-coord.SP.bam"
# Do the main work -----------------------------------------------------------
# Get situated
if [[ ! -d biostars ]]; then
mkdir biostars
fi
cd biostars || echo "cd'ing failed; check on this"
# Get the file
if [[ ! -f "${bam}" ]]; then
curl \
-L \
-o "${bam}" \
"https://dl.dropboxusercontent.com/scl/fi/6y0pohaprzvppmxigsvr1/${bam}?rlkey=614d6nrivrdz427vhku5nu6ow&st=xwcv1z42&dl=0"
fi
# Calculate the tallies with different methods
MD_all_c=$(samtools view -@ "${thr}" "${bam}" -c)
MD_all_pos=$(samtools view -@ "${thr}" "${bam}"| sort_count_by_pos)
MD_F64_pos=$(samtools view -@ "${thr}" -F 64 "${bam}"| sort_count_by_pos)
MD_F128_pos=$(samtools view -@ "${thr}" -F 128 "${bam}"| sort_count_by_pos)
MD_F1024_c=$(samtools view -@ "${thr}" -F 1024 "${bam}" -c)
MD_F1024_pos=$(samtools view -@ "${thr}" -F 1024 "${bam}" | sort_count_by_pos)
MD_F1088_pos=$(samtools view -@ "${thr}" -F 1088 "${bam}" | sort_count_by_pos)
MD_F1152_pos=$(samtools view -@ "${thr}" -F 1152 "${bam}"| sort_count_by_pos)
MD_all_frag=$(samtools view -@ "${thr}" "${bam}" | sort_count_by_frag)
MD_F64_frag=$(samtools view -@ "${thr}" -F64 "${bam}" | sort_count_by_frag)
MD_F128_frag=$(samtools view -@ "${thr}" -F128 "${bam}" | sort_count_by_frag)
# View the tallies
echo "
MD_all_c=${MD_all_c} # samtools view -c
MD_all_pos=${MD_all_pos} # samtools view | sort_count_by_pos
MD_F64_pos=${MD_F128_pos} # samtools view -F 64 | sort_count_by_pos
MD_F128_pos=${MD_F128_pos} # samtools view -F 128 | sort_count_by_pos
MD_F1024_c=${MD_F1024_c} # samtools view -F 1024 -c
MD_F1024_pos=${MD_F1024_pos} # samtools view -F 1024 | sort_count_by_pos
MD_F1088_pos=${MD_F1152_pos} # samtools view -F 1088 | sort_count_by_pos # 1088 = 64 + 1024
MD_F1152_pos=${MD_F1152_pos} # samtools view -F 1152 | sort_count_by_pos # 1152 = 128 + 1024
MD_all_frag=${MD_all_frag} # samtools view | sort_count_by_frag
MD_F64_frag=${MD_F64_frag} # samtools view -F 64 | sort_count_by_frag
MD_F128_frag=${MD_F128_frag} # samtools view -F 128 | sort_count_by_frag
"
Here are the results from running the code:
MD_all_c=568390 # samtools view -c
MD_all_pos=213261 # samtools view | sort_count_by_pos
MD_F64_pos=213261 # samtools view -F 64 | sort_count_by_pos
MD_F128_pos=213261 # samtools view -F 128 | sort_count_by_pos
MD_F1024_c=415184 # samtools view -F 1024 -c
MD_F1024_pos=207565 # samtools view -F 1024 | sort_count_by_pos
MD_F1088_pos=207565 # samtools view -F 1088 | sort_count_by_pos # 1088 = 64 + 1024
MD_F1152_pos=207565 # samtools view -F 1152 | sort_count_by_pos # 1152 = 128 + 1024
MD_all_frag=213292 # samtools view | sort_count_by_frag
MD_F64_frag=213292 # samtools view -F 64 | sort_count_by_frag
MD_F128_frag=213292 # samtools view -F 128 | sort_count_by_frag
General question: Do you have advice on the validity of these methods to calculate M_DISTINCT, the "number of distinct genomic locations to which some read maps uniquely"? Or do you have suggestions for improvement?
Additional issue: The tally value decreases when alignments marked as duplicates by samtools markdup
are excluded from the BAM file—e.g., via the calculation of MD_F1024_pos
. My expectation was that the value would be the same as MD_all_pos
because the logic in the function sort_count_by_pos
should exclude duplicate alignments with the same positional IDs as their non-duplicate counterparts.
Additional question about issue: Why does the tally value decrease when excluding duplicate alignments, and which method is most accurate for calculating M_DISTINCT?
Any insights or suggestions for alternative approaches would be greatly appreciated!
Thank you!
P.S. I tested the minimal reproducible example on a Mac with both zsh
and bash
, and on a Linux system with bash
. To save space, I've cut the program version information. Please let me know if you want to see it; I will post it to a comment.