I have many ChIP-Seq data containing duplicated data.
Firstly, I aligned these fastq files into reference genome separately, then I merged these bam files into one bigger bam file. I used MACS to do peak calling. However, many papers did not merge these bam files, but they did peak calling separately and merge these peaks produced by MACS. Does anyone know which one method is better? And how to merge these peaks generated by MACS?
ADD COMMENT
• link
updated 19 months ago by
Ram
44k
•
written 7.6 years ago by
Ben
▴
60
1
Entering edit mode
If these are biological replicates follow the IDR analysis of encode. Do quality analysis of noise to signal with SPP using cross correlation analysis as EagleEye suggested. Also perform chance in parallel to understand the quality of the signals. FInally peak calling with MACS2 (i hope you are doing with the latest). Multiple peak calling can also be done with macs2, having one input and all the bam files for your samples.
Please be reminded that SPP or IDR protocol can be only used for single-end read data. So, better use masc2 peak caller which can handle both single and pair end data.
Please check my reply for more details.
Thanks for your suggestions!
But I have another question, you siad that I should merge the common peaks from multiple peak calling. However, what are the common peaks? In fact, I do not know to merge peaks from multiple files.
Check the column 11 values. If the replicates have values close to each other, you can merge those samples and do single peak calling. Othewise you do peak calling separately and merge/ take the common peaks from both peak calling.
COL11: QualityTag: Quality tag based on thresholded RSC (codes: -2:veryLow,-1:Low,0:Medium,1:High,2:veryHigh)
The best approach is to do peak calling separately on each replicate (make sure to use input) and then use either: phantompeakqualtools if you have single end read data (Reference: https://sites.google.com/site/anshulkundaje/projects/idr).
OR
Use ChiLin: https://www.ncbi.nlm.nih.gov/pubmed/27716038 if you have pair-end data to assess the quality of each replicate.
Please remember that SPP can be only used for single end read data. So, you better use macs2 peak caller.
Nowadays, in newly coming papers calculating Pearson's correlation for checking read density for overlapping replicates is regarded as a better approach than IDR. So, you should also give it a try.
Then only select those replicates which have significant overlaps.
Later, you can merge the peaks for each replicate. Best is to perform downstream analysis on only those peaks which are overlapping.
Use Bedtools to merge peaks.
Could you comment on the differences between IDR and Pearson? I understand what each approach is doing, but given that a Pearson for the read count of the peak summits gives, lets say >= 0.9, is it then possible that IDR would mark these two replicates as unacceptable? So essentially, is a good linear correlation sufficient to assess the reproducibility of a replicate?
If these are biological replicates follow the IDR analysis of encode. Do quality analysis of noise to signal with SPP using cross correlation analysis as EagleEye suggested. Also perform chance in parallel to understand the quality of the signals. FInally peak calling with MACS2 (i hope you are doing with the latest). Multiple peak calling can also be done with macs2, having one input and all the bam files for your samples.
Check the link
Please be reminded that SPP or IDR protocol can be only used for single-end read data. So, better use masc2 peak caller which can handle both single and pair end data. Please check my reply for more details.
OP did not mention if its SE or PE.
Thanks for your suggestions! But I have another question, you siad that I should merge the common peaks from multiple peak calling. However, what are the common peaks? In fact, I do not know to merge peaks from multiple files.
Use BEDtools intersect.