Dear Biostars community,
I have assigned the reads in my IP.bam and Input.bam to gene coordinates of interest provided in a bed file. I have done this using bedtools coverageBED.
Now, I have the normalized read counts per specified gene bin for IP and Input. I would like to extract those genes which are enriched in my IP by calculating a log2ratio of reads_in_IP/reads_in_Input.
But the problem is that some gene coordinates (or gene bins) have zero reads assigned in the input, which makes it not possible to calculate a log2 ratio.
if I add a pseudo count of +1, this will significantly change the value of my log2ratios. then, if I were to use a threshold log2ratio = 1.5 to filter IP targeted genes, it will also extract "fake" genes that are only being selected because of the pseudo count.
for example, if for one gene bin, IP=7 reads and Input=0 reads, if I add pseudo count of 1 to both, I get, IP=8 reads, and Input=1 read. then log2ratio=3. given a threshold for enrichment as log2ratio=1.5, this bin would be selected as targeted in IP, when actually it might not be!
How can I correct for this????
I realize i will encounter this problem even if I use another method to assign the reads (such as bedinteresect, BEDOPS or samtools)
Can someone please advice me how to solve this?
Many thanks for your help.
This problem has been around since the days of microarrays. Consider these threads for inspiration.
A: Statistical question (Deseq, Cuffdis) when one condition is zero?
Dealing With Null Expression Values
From a purely mathematical point of view, you could approximate the log with the inverse hyperbolic sine (asinh) function e.g.