Question

Seeking advice on methods to calculate ENCODE ChIP-seq data standard M_DISTINCT

0

Entering edit mode

4 months ago

kalavattam ▴ 280

For a project involving ChIP-seq data, I am working to implement the ENCODE consortium data standards for library complexity. One metric of interest is M_DISTINCT, which is defined as the "number of distinct genomic locations to which some read maps uniquely."

I have thought of several methods to calculate M_DISTINCT and seek advice on their validity or suggestions for improvement.

Here's the script I used, which is also a minimal reproducible example that others can try. It requires samtools and awk. I have provided a small (~22 MB) BAM file of paired-end alignments via a shared link. This BAM file has been filtered with samtools view -f 2 to ensure all alignments are paired.

#!/bin/bash

#  Define functions -----------------------------------------------------------
#  Function to process samtools view output and count unique positions via the
#+ creation of a "position ID" from the following SAM fields: RNAME ($3), POS
#+ ($4), and PNEXT ($8; for paired-end alignments) or RNEXT ($7; for single-end
#+ alignments)
function sort_count_by_pos() {
    awk '{
            if ($7 == "=") {
                #  Handle paired-end alignments
                if ($4 < $8) {
                    pos = $3"_"$4"_"($8 - $4 + 1)
                } else {
                    pos = $3"_"$8"_"($4 - $8 + 1)
                }
            } else {
                #  Handle single-end alignments
                pos = $3"_"$4"_"$7
            }
            print pos
        }' \
    | sort -u \
    | wc -l \
    | sed 's: ::g'
}


#  Function to evaluate fragments from paired-end alignments via the creation 
#+ of a "fragment ID" from the following SAM fields: RNAME ($3), POS ($4), and
#+ PNEXT ($8), and TLEN ($9)
function sort_count_by_frag() {
    awk '{
            if ($9 > 0) {
                frag = $3"_"$4"_"$8"_"$9
            } else {
                frag = $3"_"$8"_"$4"_"(-$9)
            }
            print frag
        }' \
    | sort -u \
    | wc -l \
    | sed 's: ::g'
}


#  Assign variables -----------------------------------------------------------
thr=${SLURM_CPUS_ON_NODE:-4}
bam="in_Q_Hho1_6336.sort-coord.SP.bam"


#  Do the main work -----------------------------------------------------------
#  Get situated
if [[ ! -d biostars ]]; then
    mkdir biostars
fi

cd biostars || echo "cd'ing failed; check on this"

#  Get the file
if [[ ! -f "${bam}" ]]; then
    curl \
        -L \
        -o "${bam}" \
        "https://dl.dropboxusercontent.com/scl/fi/6y0pohaprzvppmxigsvr1/${bam}?rlkey=614d6nrivrdz427vhku5nu6ow&st=xwcv1z42&dl=0"
fi

#  Calculate the tallies with different methods
MD_all_c=$(samtools view -@ "${thr}" "${bam}" -c)
MD_all_pos=$(samtools view -@ "${thr}" "${bam}"| sort_count_by_pos)
MD_F64_pos=$(samtools view -@ "${thr}" -F 64 "${bam}"| sort_count_by_pos)
MD_F128_pos=$(samtools view -@ "${thr}" -F 128 "${bam}"| sort_count_by_pos)

MD_F1024_c=$(samtools view -@ "${thr}" -F 1024 "${bam}" -c)
MD_F1024_pos=$(samtools view -@ "${thr}" -F 1024 "${bam}" | sort_count_by_pos)
MD_F1088_pos=$(samtools view -@ "${thr}" -F 1088 "${bam}" | sort_count_by_pos)
MD_F1152_pos=$(samtools view -@ "${thr}" -F 1152 "${bam}"| sort_count_by_pos)

MD_all_frag=$(samtools view -@ "${thr}" "${bam}" | sort_count_by_frag)
MD_F64_frag=$(samtools view -@ "${thr}" -F64 "${bam}" | sort_count_by_frag)
MD_F128_frag=$(samtools view -@ "${thr}" -F128 "${bam}" | sort_count_by_frag)

#  View the tallies
echo "
MD_all_c=${MD_all_c}       # samtools view -c
MD_all_pos=${MD_all_pos}     # samtools view | sort_count_by_pos
MD_F64_pos=${MD_F128_pos}     # samtools view -F 64 | sort_count_by_pos
MD_F128_pos=${MD_F128_pos}    # samtools view -F 128 | sort_count_by_pos

MD_F1024_c=${MD_F1024_c}     # samtools view -F 1024 -c
MD_F1024_pos=${MD_F1024_pos}   # samtools view -F 1024 | sort_count_by_pos
MD_F1088_pos=${MD_F1152_pos}   # samtools view -F 1088 | sort_count_by_pos  # 1088 = 64 + 1024
MD_F1152_pos=${MD_F1152_pos}   # samtools view -F 1152 | sort_count_by_pos  # 1152 = 128 + 1024

MD_all_frag=${MD_all_frag}    # samtools view | sort_count_by_frag
MD_F64_frag=${MD_F64_frag}    # samtools view -F 64 | sort_count_by_frag
MD_F128_frag=${MD_F128_frag}   # samtools view -F 128 | sort_count_by_frag
"

Here are the results from running the code:

MD_all_c=568390       # samtools view -c
MD_all_pos=213261     # samtools view | sort_count_by_pos
MD_F64_pos=213261     # samtools view -F 64 | sort_count_by_pos
MD_F128_pos=213261    # samtools view -F 128 | sort_count_by_pos

MD_F1024_c=415184     # samtools view -F 1024 -c
MD_F1024_pos=207565   # samtools view -F 1024 | sort_count_by_pos
MD_F1088_pos=207565   # samtools view -F 1088 | sort_count_by_pos  # 1088 = 64 + 1024
MD_F1152_pos=207565   # samtools view -F 1152 | sort_count_by_pos  # 1152 = 128 + 1024

MD_all_frag=213292    # samtools view | sort_count_by_frag
MD_F64_frag=213292    # samtools view -F 64 | sort_count_by_frag
MD_F128_frag=213292   # samtools view -F 128 | sort_count_by_frag

General question: Do you have advice on the validity of these methods to calculate M_DISTINCT, the "number of distinct genomic locations to which some read maps uniquely"? Or do you have suggestions for improvement?

Additional issue: The tally value decreases when alignments marked as duplicates by samtools markdup are excluded from the BAM file—e.g., via the calculation of MD_F1024_pos. My expectation was that the value would be the same as MD_all_pos because the logic in the function sort_count_by_pos should exclude duplicate alignments with the same positional IDs as their non-duplicate counterparts.

Additional question about issue: Why does the tally value decrease when excluding duplicate alignments, and which method is most accurate for calculating M_DISTINCT?

Any insights or suggestions for alternative approaches would be greatly appreciated!

Thank you!

P.S. I tested the minimal reproducible example on a Mac with both zsh and bash, and on a Linux system with bash. To save space, I've cut the program version information. Please let me know if you want to see it; I will post it to a comment.

ChIP-seq quality-check ENCODE library-complexity • 435 views

ADD COMMENT • link 3 months ago by kalavattam ▴ 280

score 1 · Accepted Answer · 2024-09-02

Short answer

In ENCODE NGS processing pipelines, the M_DISTINCT value is derived from the "DISTINCT READS" output of preseq lc_extrap when run in verbose mode.

Long answer with additional information and context

In the ENCODE ChIP-seq pipeline, metrics such as M_DISTINCT are calculated using the program Preseq, which—among other things—estimates the number of unique reads expected from sequencing an NGS library to a certain depth. In particular, by running preseq lc_extrap in verbose mode, one can extract M_DISTINCT from the "DISTINCT READS" output. Additional important preseq lc_extrap metrics include the following:

M_1: Extracted from "COUNTS OF 1".
M_2: Extracted from "OBSERVED COUNTS" for 2.
Total number of reads: Extracted from "TOTAL READS".

These values are used to calculate the following ENCODE ChIP-seq quality metrics:

PCR Bottlenecking Coefficient 1 (PBC1): PBC1 = M_1 ÷ M_DISTINCT = "COUNTS OF 1" ÷ "DISTINCT READS"
PCR Bottlenecking Coefficient 2 (PBC2): PBC2 = M_1 ÷ M_2 = "COUNTS OF 1" ÷ "OBSERVED COUNTS" for 2
Non-Redundant Fraction (NRF): NRF = The number of distinct uniquely mapping reads (i.e., the number after removing duplicates) ÷ the total number of reads = "DISTINCT READS" ÷ "TOTAL READS"

On a related note, the questions I ask above require additional research, because while preseq lc_extrap provides one approach to calculate M_DISTINCT, it is not the only method to estimate the number of distinct genomic locations to which reads uniquely align. The process of determining unique alignments is complicated—especially for paired-end sequenced short reads—and can vary depending on the method used (e.g., the samtools markdup algorithm).

(See here and here for relevant ENCODE code examples.)