Let's assume that you have extracted read-depths from BAM files (containing aligned reads for a dna-seq experiment) for both samples of normal/tumor pair and you have calculated the read-depth log-ratios simply by:
log-ratio = log_2( tumor_rd_i / normal_rd_i)
where tumor_rd_i
and normal_rd_i
are the tumor and normal read depth at i_th
window.
The normalization procedure goes like this:
For each chromosome, you make a histogram of all the log ratios values and find the bin with highest frequency and then find the mid point of this bin (let call this value 'mode'). Then you find the median value of all the modes (coming from all the chromosomes). Finally, you subtract this value (median of the modes) from each read depth log ratio value.
I can imagine that this procedure shifts the individual values towards zero; something like performing zero-mean normalization. But what I cannot get is that why we need to do such normalization in the first place for the sequencing data before for instance doing segmentation. Any explanation is much appreciated.
Could you please provide reference (you are saying "need to do such normalization", where are you finding this information)?
I am reading a Python code and what I have written here is a summary of I what I have understood from the code.