Pseudoreplicate creation from bam files
0
0
Entering edit mode
4 days ago
sarahmanderni ▴ 120

I want to identifying consensus peaks from several biological replicates of a ChIP-seq experiment using IDR (Irreproducibility Discovery Rate) and for that I need to create 2 pseudoreplicates from my merged bam file. So: merging the bam files from all replicates, shuffle the merged file, randomly create two replicates out of that (following the approach described here: https://hbctraining.github.io/Intro-to-ChIPseq/lessons/07_handling-replicates-idr.html). The problem is the shuffling part is so memory intensive that the script crashes. Do you have an alternative approach for this purpose? Here is my code:

Merging bam files:

samtools merge -f -@ 12 -o $tmp_dir/${prefix}_merged.bam "${bams[@]/#/$input_dir/}"

Sort the merged file:

samtools sort -@ 12 -o $tmp_dir/${prefix}_merged_sorted.bam $tmp_dir/${prefix}_merged.bam

Save the header:

samtools view -H $tmp_dir/${prefix}_merged_sorted.bam > $tmp_dir/${prefix}_header.sam

Find the number of reads per pseudoreplicate:

nlines=$(samtools view -c $tmp_dir/${prefix}_merged_sorted.bam)
half=$((nlines / 2))

Shuffle and split:

samtools view $tmp_dir/${prefix}_merged_sorted.bam | shuf | split -d -l $half - $tmp_dir/${prefix}_pseudo

Add the header for two pseudorepliactes:

cat $tmp_dir/${prefix}_header.sam $tmp_dir/${prefix}_pseudo00 | samtools view -bS - > $output_dir/${prefix}_pseudo1.bam
cat $tmp_dir/${prefix}_header.sam $tmp_dir/${prefix}_pseudo01 | samtools view -bS - > $output_dir/${prefix}_pseudo2.bam
chipseq IDR shuffle_bam • 369 views
ADD COMMENT
1
Entering edit mode

I do not follow. If you have biological replicates, why not use these rather than created artificial replication?

ADD REPLY
0
Entering edit mode

Pseudoreplicates are required by IDR tool to identify consensus peaks across biological replicates as shown in this tutorial: https://hbctraining.github.io/Intro-to-ChIPseq/lessons/07_handling-replicates-idr.html Or have I misunderstood something here? I have done the peak calling separately for each biological replicate and now want to find the consensus peaks across them. There are many ways to do that and this time I wanted to use IDR approach.

ADD REPLY
1
Entering edit mode

I do not now this particular tutorial but the way I know and use IDR is to use the real replicates. Essentially, you call peaks in a lenient fashion per replicate, for example using a p-value (not FDR) cutoff of 0.05, and then apply IDR. The lenient cutoff is necessary because the IDR model needs some noise in the data to find the cutoff between consistent signal and background noise. Downside is that IDR can only take n=2 natively, so you could use your best two replicates, or do all pairwise combinations and keep peaks found by IDR in at least X%. It's always a bit cumbersome with this tool.

ADD REPLY
0
Entering edit mode

Thanks so much for the clarification! I checked the ENCODE pipeline again and apparently the pseudoreplicates are suitable for settings without replicates or when assessing individual replicates. In my case, we have high quality 4 biological replicates and I think I will follow what you suggested.

ADD REPLY

Login before adding your answer.

Traffic: 1717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6