I want to identifying consensus peaks from several biological replicates of a ChIP-seq experiment using IDR (Irreproducibility Discovery Rate) and for that I need to create 2 pseudoreplicates from my merged bam file. So: merging the bam files from all replicates, shuffle the merged file, randomly create two replicates out of that (following the approach described here: https://hbctraining.github.io/Intro-to-ChIPseq/lessons/07_handling-replicates-idr.html). The problem is the shuffling part is so memory intensive that the script crashes. Do you have an alternative approach for this purpose? Here is my code:
Merging bam files:
samtools merge -f -@ 12 -o $tmp_dir/${prefix}_merged.bam "${bams[@]/#/$input_dir/}"
Sort the merged file:
samtools sort -@ 12 -o $tmp_dir/${prefix}_merged_sorted.bam $tmp_dir/${prefix}_merged.bam
Save the header:
samtools view -H $tmp_dir/${prefix}_merged_sorted.bam > $tmp_dir/${prefix}_header.sam
Find the number of reads per pseudoreplicate:
nlines=$(samtools view -c $tmp_dir/${prefix}_merged_sorted.bam)
half=$((nlines / 2))
Shuffle and split:
samtools view $tmp_dir/${prefix}_merged_sorted.bam | shuf | split -d -l $half - $tmp_dir/${prefix}_pseudo
Add the header for two pseudorepliactes:
cat $tmp_dir/${prefix}_header.sam $tmp_dir/${prefix}_pseudo00 | samtools view -bS - > $output_dir/${prefix}_pseudo1.bam
cat $tmp_dir/${prefix}_header.sam $tmp_dir/${prefix}_pseudo01 | samtools view -bS - > $output_dir/${prefix}_pseudo2.bam
I do not follow. If you have biological replicates, why not use these rather than created artificial replication?
Pseudoreplicates are required by IDR tool to identify consensus peaks across biological replicates as shown in this tutorial: https://hbctraining.github.io/Intro-to-ChIPseq/lessons/07_handling-replicates-idr.html Or have I misunderstood something here? I have done the peak calling separately for each biological replicate and now want to find the consensus peaks across them. There are many ways to do that and this time I wanted to use IDR approach.
I do not now this particular tutorial but the way I know and use IDR is to use the real replicates. Essentially, you call peaks in a lenient fashion per replicate, for example using a p-value (not FDR) cutoff of 0.05, and then apply IDR. The lenient cutoff is necessary because the IDR model needs some noise in the data to find the cutoff between consistent signal and background noise. Downside is that IDR can only take n=2 natively, so you could use your best two replicates, or do all pairwise combinations and keep peaks found by IDR in at least X%. It's always a bit cumbersome with this tool.
Thanks so much for the clarification! I checked the ENCODE pipeline again and apparently the pseudoreplicates are suitable for settings without replicates or when assessing individual replicates. In my case, we have high quality 4 biological replicates and I think I will follow what you suggested.