I always downsample my ChIP-seq BAM files to the file with the lowest # of reads before I do any peak calling. My question is, what happens when you want to compare your data to publicly available data that has much, much lower coverage? I usually get about 60 million unique reads, and there's a dataset I'd really like to compare my data to (it's in a different cell line and I want to see if the distribution of peaks is different), but they only have about 17 million reads. I'm hesitant to downsample my own data by that much, but I imagine "upsampling" their data would only lead to a bunch of false positive data... Does anyone know what the convention is for this kind of problem?
Thanks in advance!
Comparisons over batch effects are problematic for a variety of reasons. What is the exact comparison you're trying to make? Hopefully you're not trying to use some published sample from someone else as a control for a comparison, that's recipe for problems.