Hi all, I am currently using deepTools to process some chip-seq data and need a sanity check as to how to interpret/deal with the output, as I am not sure after reading the docs. I have some alignment files from multiple replicates of a ChIP-seq experiment targeting a histone mark. What I need to do is assign the reads to 100bp bins covering the genome, and then get a normalized value for each bin representing the abundance of reads that map to it.
To do this, I am using deeptools bamCompare to take in the alignments of both replicates from a single experiement, finding the RPKM for each bin, for each replicate, and then getting the mean of these values to get a single RPKM value for each bin.
The issue is that I don't get an output of contiguous 100bp bins, rather the bins seemed to be merged. While I don't see it on the bamcompare docs page, I assume this is the same as in bamCoverage, where "If consecutive bins have the same number of reads overlapping they are merged." What I'm wondering is:could I simply modify the output begraph (using something like awk) to split the "merged" bins into the size I need (and giving them all the RPKM value from the original, merged/large bin)? Or am I misunderstanding the effect of the merging on the calculation?
Alternatively, if there is a better way to accomplish what I'm trying to, I'd welcome any advice - I'm very new to bioninformatics stuff. I need the data in these bins to input into another model; I'm not doing any DB analysis or anything like that.
Thanks!
I will try this and update with results, thanks! Also, is there any way you could please explain the calculation for the scale factor? I'm having trouble understanding where the 8 comes from, and why to use IDXstats. Thanks so much again!
The 8 is the number of digits that the calculated value is rounded to. You can use any number you want, but do not set it too low, to avoid larger rounding errors.
idxstats
is simply used to quickly get the total number of reads in the BAM file. That is much faster than flagstat.