I am experimenting with using edgeR on some ChIP-Seq data to discover regions that are differentially bound between experimental conditions. For now, my regions of interest are annotated gene starts from UCSC +/- 1000 bp, though later I will use a peak caller on the data to define regions of interest. The input to edgeR is a matrix of read counts, with each column being a different sample and each row being a different genomic feature. These counts should be totally un-normalized.
I want to do background subtraction on these counts using the ChIP-Seq input samples that I have, and then feed the background-subtracted counts to edgeR (replacing negatives with zeros). However, naturally the library sizes differ between each ChIP and its corresponding input sample, so in order to do subtraction I would have to normalize one to the other. The most straightforward way I can think of is to normalize each input to have the same total number of mapped reads as the corresponding ChIP and then subtract the input counts from the ChIP counts for each feature. However, I suspect that in this case I would effectively be overstating the precision of the counts that I provide to edgeR, since the subtracted counts are subject to the variation in both the ChIP and the control samples. Furthermore, the counts are not really un-normalized any longer, since they are the subtraction of a normalized control count from a raw ChIP count. Is it still appropriate to feed these count data into edgeR to search for differential binding? Is there a better way to do background subtraction in the context of differential binding when using tools like edgeR that take raw counts as input?
EdgeR is designed to deal with transcriptome data, for ChiP-seq there are other Bioconductor packages, from the website:
This isn't true; it's generally for count data, which can arise in multiple experimental designs. To quote from their paper: "edgeR may also be useful in other experiments that generate counts, such as ChIP-seq, in proteomics experiments where spectral counts are used to summarize the peptide abundance, or in barcoding experiments where several species are counted."
you're right, they mentioned it in the paper and the vignette, but they only provide examples for transcriptome data