Hi all!
I am working with ChiP-Seq data (Control and Treatment samples, with 2 replicates in each) and I have used DiffBind R package to finally obtain the differential binding sites.
First of all I tryed with the protocol "DiffBind: Differential binding analysis of ChIP-Seq peak data" to see how it works. But my results were not really goods: I only had 39 differential binding sites, with the default parameters.
So I took the "Reference Manual" and tryed with a lot of different parameters and options.
First I read the peaksets comparing two options: using FDR values, or p-values for the analysis.
ChIP_peaklist_FDR <- dba(sampleSheet="Sheet.csv", dir=system.file("extra", package="DiffBind"), config=data.frame(bUsePval=FALSE))
ChIP_peaklist_pval <- dba(sampleSheet="Sheet.csv", dir=system.file("extra", package="DiffBind"), config=data.frame(bUsePval=TRUE))
Both of them gave me the same result: 4 Samples, 11871 sites in matrix (19459 total)
So the 1ST QUESTION is, if both parameters could be used for DiffBind, I mean if one is more correct than the other one or if it is better to use one of them.
Then I counted the reads first with default parameters, then with summit=250 option and also with summit=500 option, as my ChIP sequences came from a histon modification with broad peaks.
ChIP_count_FDR <- dba.count(ChIP_peaklist_FDR)
ChIP_count_FDR_250 <- dba.count(ChIP_peaklist_FDR, summits=250)
ChIP_count_FDR_500 <- dba.count(ChIP_peaklist_FDR, summits=500)
ChIP_count_pval <- dba.count(ChIP_peaklist_pval)
ChIP_count_pval_250 <- dba.count(ChIP_peaklist_pval, summits=250)
ChIP_count_pval_500 <- dba.count(ChIP_peaklist_pval, summits=500)
All of them resulted in apparently same matrix values:
4 Samples, 11871 sites in matrix
But the FRiP values changed depending on used options. without summit specification it was between 0,07-0,08. In summits=250 it was 0.02 and in summits=500 it was 0.03
So my 2ND QUESTION is if you think they are very small values for the proportion of reads that overlap a peak in the consensus peakset.
Then I did the contrast of Contol samples vs Treatment samples and finally de DB step with method=DBA_ALL_METHODS to compare EDGER and DESEQ2 normalizations. And also I used bFullLibrarySize = TRUE or FALSE to compare the differential binding sites values taking into acount the full library size for scalling the normalization.
ChIP_DB_FDR_TRUE <- dba.analyze(ChIP_count_FDR_contrast, method=DBA_ALL_METHODS, bFullLibrarySize = TRUE)
ChIP_DB_FDR_FALSE <- dba.analyze(ChIP_count_FDR_contrast, method=DBA_ALL_METHODS, bFullLibrarySize = FALSE)
ChIP_DB_FDR_250_TRUE <- dba.analyze(ChIP_count_FDR_250_contrast, method=DBA_ALL_METHODS, bFullLibrarySize = TRUE)
ChIP_DB_FDR_250_FALSE <- dba.analyze(ChIP_count_FDR_250_contrast, method=DBA_ALL_METHODS, bFullLibrarySize = FALSE)
ChIP_DB_FDR_500_TRUE <- dba.analyze(ChIP_count_FDR_500_contrast, method=DBA_ALL_METHODS, bFullLibrarySize = TRUE)
ChIP_DB_FDR_500_FALSE <- dba.analyze(ChIP_count_FDR_500_contrast, method=DBA_ALL_METHODS, bFullLibrarySize = FALSE)
ChIP_DB_pval_TRUE <- dba.analyze(ChIP_count_pval_contrast, method=DBA_ALL_METHODS, bFullLibrarySize = TRUE)
ChIP_DB_pval_FALSE <- dba.analyze(ChIP_count_pval_contrast, method=DBA_ALL_METHODS, bFullLibrarySize = FALSE)
ChIP_DB_pval_250_TRUE <- dba.analyze(ChIP_count_pval_250_contrast,method=DBA_ALL_METHODS, bFullLibrarySize = TRUE)
ChIP_DB_pval_250_FALSE <- dba.analyze(ChIP_count_pval_250_contrast,method=DBA_ALL_METHODS, bFullLibrarySize = FALSE)
ChIP_DB_pval_500_TRUE <- dba.analyze(ChIP_count_pval_500_contrast,method=DBA_ALL_METHODS, bFullLibrarySize = TRUE)
ChIP_DB_pval_500_FALSE <- dba.analyze(ChIP_count_pval_500_contrast,method=DBA_ALL_METHODS, bFullLibrarySize = FALSE)
3RD QUESTION: Which is the best way of carrying out the differential analysis, taking the bFullLibrarySize in TRUE or in FALSE when comparing 2 replicates of Control and Treatment sample. And if someone could explain me better the difference between the two options as I did not understand them very well.
Finally after doing the dba.analyze part, I could compare all the used options in DiffBind site term, and surprisingly I found really different results, as others have commented before in the post https://www.biostars.org/p/278191/:
FDR FDR_250 FDR_500 pval pval_250 pval_500
EDGER.TRUE 204 159 184 1207 1255 1204
EDGER.FALSE 205 158 184 1217 1270 1201
DESEQ2.TRUE 39 25 35 353 261 337
DESEQ2.FALSE 72 36 47 448 287 379
In the mentioned post the experts commented that the differences were because of the normalization method, but looking at this results, I don't know what option to chose as the better one. I don't know if the only criterion must be to select the one with more DB sites.
4TH QUESTION: What things do I need to take into account to decide what results believe and use for downstream steps?
Furthermore, when looking to the MAplot I can observe that I have the same plots when using bNORMALIZE=FALSE and method=DBA_DESEQ2. 5TH QUESTION, does it mean that in my case the DESEQ2 normalization doesn't work?
Sorry for this long post, and I hope someone could answer and help me with these issues (Rory Stark if you are there please help me!). I have been trying a lot of different parameters, finding answers in biostars and other places, in google, but still have some difficulties to understand things and take some final decisions, so I would be very grateful if someone can advise me which results to believe.
If more information about comands and plots is needed please tell me.
Thanks in advance,
Iraia