Seeking advice on methods to calculate ENCODE ChIP-seq data standard M_DISTINCT
1
0
Entering edit mode
3 months ago
kalavattam ▴ 280

For a project involving ChIP-seq data, I am working to implement the ENCODE consortium data standards for library complexity. One metric of interest is M_DISTINCT, which is defined as the "number of distinct genomic locations to which some read maps uniquely."

I have thought of several methods to calculate M_DISTINCT and seek advice on their validity or suggestions for improvement.

Here's the script I used, which is also a minimal reproducible example that others can try. It requires samtools and awk. I have provided a small (~22 MB) BAM file of paired-end alignments via a shared link. This BAM file has been filtered with samtools view -f 2 to ensure all alignments are paired.

#!/bin/bash

#  Define functions -----------------------------------------------------------
#  Function to process samtools view output and count unique positions via the
#+ creation of a "position ID" from the following SAM fields: RNAME ($3), POS
#+ ($4), and PNEXT ($8; for paired-end alignments) or RNEXT ($7; for single-end
#+ alignments)
function sort_count_by_pos() {
    awk '{
            if ($7 == "=") {
                #  Handle paired-end alignments
                if ($4 < $8) {
                    pos = $3"_"$4"_"($8 - $4 + 1)
                } else {
                    pos = $3"_"$8"_"($4 - $8 + 1)
                }
            } else {
                #  Handle single-end alignments
                pos = $3"_"$4"_"$7
            }
            print pos
        }' \
    | sort -u \
    | wc -l \
    | sed 's: ::g'
}


#  Function to evaluate fragments from paired-end alignments via the creation 
#+ of a "fragment ID" from the following SAM fields: RNAME ($3), POS ($4), and
#+ PNEXT ($8), and TLEN ($9)
function sort_count_by_frag() {
    awk '{
            if ($9 > 0) {
                frag = $3"_"$4"_"$8"_"$9
            } else {
                frag = $3"_"$8"_"$4"_"(-$9)
            }
            print frag
        }' \
    | sort -u \
    | wc -l \
    | sed 's: ::g'
}


#  Assign variables -----------------------------------------------------------
thr=${SLURM_CPUS_ON_NODE:-4}
bam="in_Q_Hho1_6336.sort-coord.SP.bam"


#  Do the main work -----------------------------------------------------------
#  Get situated
if [[ ! -d biostars ]]; then
    mkdir biostars
fi

cd biostars || echo "cd'ing failed; check on this"

#  Get the file
if [[ ! -f "${bam}" ]]; then
    curl \
        -L \
        -o "${bam}" \
        "https://dl.dropboxusercontent.com/scl/fi/6y0pohaprzvppmxigsvr1/${bam}?rlkey=614d6nrivrdz427vhku5nu6ow&st=xwcv1z42&dl=0"
fi

#  Calculate the tallies with different methods
MD_all_c=$(samtools view -@ "${thr}" "${bam}" -c)
MD_all_pos=$(samtools view -@ "${thr}" "${bam}"| sort_count_by_pos)
MD_F64_pos=$(samtools view -@ "${thr}" -F 64 "${bam}"| sort_count_by_pos)
MD_F128_pos=$(samtools view -@ "${thr}" -F 128 "${bam}"| sort_count_by_pos)

MD_F1024_c=$(samtools view -@ "${thr}" -F 1024 "${bam}" -c)
MD_F1024_pos=$(samtools view -@ "${thr}" -F 1024 "${bam}" | sort_count_by_pos)
MD_F1088_pos=$(samtools view -@ "${thr}" -F 1088 "${bam}" | sort_count_by_pos)
MD_F1152_pos=$(samtools view -@ "${thr}" -F 1152 "${bam}"| sort_count_by_pos)

MD_all_frag=$(samtools view -@ "${thr}" "${bam}" | sort_count_by_frag)
MD_F64_frag=$(samtools view -@ "${thr}" -F64 "${bam}" | sort_count_by_frag)
MD_F128_frag=$(samtools view -@ "${thr}" -F128 "${bam}" | sort_count_by_frag)

#  View the tallies
echo "
MD_all_c=${MD_all_c}       # samtools view -c
MD_all_pos=${MD_all_pos}     # samtools view | sort_count_by_pos
MD_F64_pos=${MD_F128_pos}     # samtools view -F 64 | sort_count_by_pos
MD_F128_pos=${MD_F128_pos}    # samtools view -F 128 | sort_count_by_pos

MD_F1024_c=${MD_F1024_c}     # samtools view -F 1024 -c
MD_F1024_pos=${MD_F1024_pos}   # samtools view -F 1024 | sort_count_by_pos
MD_F1088_pos=${MD_F1152_pos}   # samtools view -F 1088 | sort_count_by_pos  # 1088 = 64 + 1024
MD_F1152_pos=${MD_F1152_pos}   # samtools view -F 1152 | sort_count_by_pos  # 1152 = 128 + 1024

MD_all_frag=${MD_all_frag}    # samtools view | sort_count_by_frag
MD_F64_frag=${MD_F64_frag}    # samtools view -F 64 | sort_count_by_frag
MD_F128_frag=${MD_F128_frag}   # samtools view -F 128 | sort_count_by_frag
"

Here are the results from running the code:

MD_all_c=568390       # samtools view -c
MD_all_pos=213261     # samtools view | sort_count_by_pos
MD_F64_pos=213261     # samtools view -F 64 | sort_count_by_pos
MD_F128_pos=213261    # samtools view -F 128 | sort_count_by_pos

MD_F1024_c=415184     # samtools view -F 1024 -c
MD_F1024_pos=207565   # samtools view -F 1024 | sort_count_by_pos
MD_F1088_pos=207565   # samtools view -F 1088 | sort_count_by_pos  # 1088 = 64 + 1024
MD_F1152_pos=207565   # samtools view -F 1152 | sort_count_by_pos  # 1152 = 128 + 1024

MD_all_frag=213292    # samtools view | sort_count_by_frag
MD_F64_frag=213292    # samtools view -F 64 | sort_count_by_frag
MD_F128_frag=213292   # samtools view -F 128 | sort_count_by_frag

General question: Do you have advice on the validity of these methods to calculate M_DISTINCT, the "number of distinct genomic locations to which some read maps uniquely"? Or do you have suggestions for improvement?

Additional issue: The tally value decreases when alignments marked as duplicates by samtools markdup are excluded from the BAM file—e.g., via the calculation of MD_F1024_pos. My expectation was that the value would be the same as MD_all_pos because the logic in the function sort_count_by_pos should exclude duplicate alignments with the same positional IDs as their non-duplicate counterparts.

Additional question about issue: Why does the tally value decrease when excluding duplicate alignments, and which method is most accurate for calculating M_DISTINCT?

Any insights or suggestions for alternative approaches would be greatly appreciated!

Thank you!

P.S. I tested the minimal reproducible example on a Mac with both zsh and bash, and on a Linux system with bash. To save space, I've cut the program version information. Please let me know if you want to see it; I will post it to a comment.

ChIP-seq quality-check ENCODE library-complexity • 404 views
ADD COMMENT
1
Entering edit mode
12 weeks ago
kalavattam ▴ 280

Short answer


In ENCODE NGS processing pipelines, the M_DISTINCT value is derived from the "DISTINCT READS" output of preseq lc_extrap when run in verbose mode.

Long answer with additional information and context


In the ENCODE ChIP-seq pipeline, metrics such as M_DISTINCT are calculated using the program Preseq, which—among other things—estimates the number of unique reads expected from sequencing an NGS library to a certain depth. In particular, by running preseq lc_extrap in verbose mode, one can extract M_DISTINCT from the "DISTINCT READS" output. Additional important preseq lc_extrap metrics include the following:

  • M_1: Extracted from "COUNTS OF 1".
  • M_2: Extracted from "OBSERVED COUNTS" for 2.
  • Total number of reads: Extracted from "TOTAL READS".

These values are used to calculate the following ENCODE ChIP-seq quality metrics:

  • PCR Bottlenecking Coefficient 1 (PBC1): PBC1 = M_1 ÷ M_DISTINCT = "COUNTS OF 1" ÷ "DISTINCT READS"
  • PCR Bottlenecking Coefficient 2 (PBC2): PBC2 = M_1 ÷ M_2 = "COUNTS OF 1" ÷ "OBSERVED COUNTS" for 2
  • Non-Redundant Fraction (NRF): NRF = The number of distinct uniquely mapping reads (i.e., the number after removing duplicates) ÷ the total number of reads = "DISTINCT READS" ÷ "TOTAL READS"

On a related note, the questions I ask above require additional research, because while preseq lc_extrap provides one approach to calculate M_DISTINCT, it is not the only method to estimate the number of distinct genomic locations to which reads uniquely align. The process of determining unique alignments is complicated—especially for paired-end sequenced short reads—and can vary depending on the method used (e.g., the samtools markdup algorithm).

(See here and here for relevant ENCODE code examples.)

ADD COMMENT

Login before adding your answer.

Traffic: 1682 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6