Entering edit mode
16 months ago
kalavattam
▴
280
Let's say a ChIP-seq sample is scaled in the following way:
- Calculate the percent spike-in reads in the immunoprecipitate (IP)
- Calculate the percent spike-in reads in the input
- Calculate the scaling factor by dividing the input spike-in percentage by the IP spike-in percentage, i.e.,
scaling factor = input spike-in % ÷ IP spike-in %
- Normalize all scaling factors for the samples under comparison by dividing by the highest or lowest scaling factor, depending on the downstream application.
- Apply the scaling factor using, for example,
bamCoverage
:bamCoverage \ -b "${bam}" \ --binSize "${bin_size}" \ --scaleFactor ${scaling_factor} \ -o "${bigwig}"
Is it valid to do a coverage normalization in addition to the application of the scaling factor?
# Say, for example, RPKM (reads per kilobase per million mapped) normalization...
bamCoverage \
-b "${bam}" \
--binSize "${bin_size}" \
--scaleFactor ${scaling_factor} \
--normalizeUsing RPKM \
-o "${bigwig}"
# Or perhaps BPM (bins per million) normalization (same as TPM in RNA-seq)...
bamCoverage \
-b "${bam}" \
--binSize "${bin_size}" \
--scaleFactor ${scaling_factor} \
--normalizeUsing BPM \
-o "${bigwig}"
jared.andrews07 — Thank you for your input and for taking the time to review the code.
Based on my analysis of the deepTools code, I believe that, when invoking deepTools'
bamCoverage
from the command line, any custom scaling factor assigned through--scalingFactor ${value}
(where variablevalue
is a user-provided positive integer or float) is recognized by the functionget_scale_factor()
(which is called by bamCoverage.py) and thus included in any subsequent RPKM (or RPGC or CPM or BPM) calculation. This is because all command-line argument values, including${value}
from--scalingFactor
, are assigned to the objectargs
in bamCoverage.py; the objectargs
is then passed to theget_scale_factor()
function as an argument, meaning thatargs.scaleFactor
is accessible to code within the functionget_scale_factor()
.Within getScaleFactor.py, we see
The
scale_factor
is multiplied by1.0 / (million_reads_mapped * tile_len_in_kb)
, making a new "RPKM-transformed" scaling factor. Back in bamCoverage.py, this updated value forscale_factor
is stored and made available to subsequent code. Ultimately, the scaling factor is applied to coverage/signal and written out to a bedGraph or a bigWig via the functionwriteBedGraph.scaleCoverage
.Anyway, I hope I'm not coming off as pedantic with all of this. The reason I bring this all up is because I work with researchers who have historically called
bamCoverage
like this:This seems weird to me and I want to check my understanding with people that have knowledge of and experience with calculating and plotting "coverage" for ChIP-seq (and other NGS) assays.
The reason this seems weird to me is that, in our specific analyses, we have calculated scaling factors based on spike-in reads. We use spike-ins because we expect genome-wide shifts in coverage between our control and experimental "models"; these shifts would not really be apparent without spike-in normalization.
Also, if it is valid to apply an additional normalization, I recognize that B/TPM normalization should be favored over R/FPKM normalization since, per this and other papers,
If you made it this far, thanks for reading it all.
Very good digging. As for your questions:
I actually don't know. I may have to try this to see how it impacts things. Or if you do, I'd be super interested in the results.
I would think so, and the DiffBind vignette (section 7.6) seems to support this.
I'll see if I can get the deeptools dev to weigh in, I'd like to hear his thoughts on this.
Thank you. I agree that, for the purpose of illustration, it's a good idea to post some examples of coverage plots with (1) no normalization, (2) spike-in normalization, (3) both spike-in and RPKM normalization. I'll do that soon.
This was written after the post from Devon Ryan, so we know that applying both spike-in and RPKM normalization is an invalid approach. To accompany that post—and for anyone who has this question and reads this post in the future—it'll be good to show why.