Hello Biostars community,
I would be grateful to get your ideas on a pipeline to correlate the reads of a RNAseq dataset to chipset dataset? I have read many posts where one can calculate the counts per million (CPM) per genomic bin and correlate the two. But I have two questions regarding this approach:
1) How do you account for the large number of 0 reads bins in the RNA-seq sample, arising from the fact that RNA seq enrichment is restricted to only expressed regions of the genome? These zero bins will be included in the correlation calculation.
2) RNA-seq profile is only an enrichment profile (ie. the coverage profile is always >=0). whereas a chipSeq profile is both enrichment as well as depletion. How do you take this factor into account to compare the two and compute the correlation?
Many thanks in advance for your help and suggestions!
Good point, often zeros are (optionally) excluded by the tool which then creates a matrix and performs the correlation, eg. deeptools.
yes, and if you get rid of those zero reads bins, you are essentially getting rid of any potential negative correlation with your chip seq data that you were looking for in the first place. or am I wrong on this point?