Hi!
When using macs2 peak caller to ChIP-seq data, I pick up some pvalues from --cutoff-analysis and test the pvalues until I got a reasonable balance between accurate peak call and false positive calls of background noises. But I found the best pvalues differs between samples. For example, given p= 0.0001 was the best for CTCF.rep1, performing peak call -pvalue 0.0001 to CTCF.rep2 ended up losing most of peaks, and instead, p=0.01 turn out the best for CTCF.rep2. This is maybe due to the difference of coverage of each sample.
I have four samples of CTCF peaks. Control.rep1, control.rep2, treatment.rep1 and treatment.rep2. The impression by my eye suggested pvalues of 0.01, 0.001, 0.0001 and 0,00001 are the best, respectively. When peak calling by p=0.01, I got a lot of background noise in treatment.rep2 sample, so I want to use p=0.01 to control.rep1 and p=0.00001 to treatment.rep2, if possible.
I think there are some options.
- Using a single p value to all of these samples.
- p=0.01 for the control samples and p=0.0001 for the treatment samples. (the same p to each treatment group)
- Or, maybe use the same p to each replicate.
- Using four independent p values to each samples.
Is it OK to use different p values, or I should use a single cutoff p value to all of my samples?
Thanks in advance, Taisuke
First of all it would be nice to know about your replicates, have you consider a batch effect? Also, have you done any trimming in your fastq files? and were all the samples mapped using the same parameters? Saying this, in most of the case the reproducibility of peaks depends on sequencing depth : do you have a similar coverage in both replicates to detect the peaks? If this is the case, I would call the peaks using the same p value.
Thank you for replying!
I did the CUT&RUN on the same day and submitted the samples to the same company, so the batch effect is considerably low. I trimmed the fastq file using a program in cutruntools and mapped them with the same parameters.
The mapping summary is as following;
As you can see, the coverage is very different in one sample. If using different p values are not acceptable way, down-sampling of the other samples may help get a reasonable result. What do you think?
Looking at your depth, it seems that there is something wrong with rep 2 SH, because rest of the samples have nearly two times more... could you figure what is wrong with it? An error with the sequencing? missing something during trimming? mapping? Saying that, if you thing everything is good with rep2, I will try to sequence it again (if possible) but if not, I would not include it in the analysis.
The file size of SH.rep2_R1_001.fastq.gz and R2_001.fastq.gz are much smaller than the others. So, simply the seq data I got was fewer in this sample. Otherwise, FASTQ qualities are similar, alignment rates are similar... Also, the duplication rate is high in both of these rep2 samples around 30%. Though CUT&RUN tends to create duplicated reads, but still this seems high. Maybe, the library prep material was too low and saturated in these samples and got fewer data in SH.rep2??? Anyway, something seems to go wrong with these. Excluding these rep2 samples may be safer. Thanks a lot.
seems to be the problem