Hello,
I am new to ChIP-seq data mining (by myself) and I am having a hard time comparing two ChIP-seq data sets. Could you tell me if my pipeline is correct and if I should add a normalization step somewhere? I am simply going through this with each one of my IP-Input files and looking at the genome browser. The problem is the IP of sample B is sequenced at a bigger depth than sample A and it seems the peaks are not all being called. When I compare to previously published data it seems I am missing some peaks on sample B. I thought about changing some thresholds on sample B but then the pipeline would be different from sample A... So, what should I do to ensure all my peaks are present in sample B but the data is still comparable to sample A? Thanks!!!
Samples: H3K4me3 and Input data on sample A and H3K4me3 and Input data on sample B.
Original file: BAM (from the sequences).
My pipeline:
- Sort pair-ended BAM files by read name:
samtools sort -n c3h.input.bam c3h.input.sorted
- Convert BAM files to FASTQ format:
bam2fastx -q -Q -A -o c3h.input.fastq -P -N c3h.input.sorted.bam
Map reads to the genome:
bowtie2 -p 8 -x ~/Data/Genomes/mm10/bt2idx/bt2idx -1 c3h.input.1.fastq -2 c3h.input.2.fastq -S c3h.input.sam&
Bowtie2 options:
End-to-end alignment (default) vs. local alignment (
--local
)Paired-end alignment:
-1 mate1.fq -2 mate2.fq
Concordant + discordant alignment (default) vs. concordant only (
--no-discordant
)Mixed mode: paired where possible; otherwise unpaired (default) vs. paired-only mode (
--no-mixed
)Overlapping with the other mate is concordant (default) vs. (
--no-overlap
)Containing the other mate is concordant (default) vs. (
--no-contain
)Dovetailing the other mate is discordant (default) vs. (
--dovetail
)Bowtie2 output:
Mapping quality: MAPQ
Peak analysis using MACS2:
macs2 callpeak -t ~/Data/BCCRC/chipseq/mapping/tt2.k4m3.sam -c ~/Data/BCCRC/chipseq/mapping/tt2.input.sam -f SAM -g mm -n tt2.k4m3 -B -q 0.01 --outdir ~/Data/BCCRC/chipseq/tt2.k4m3.out
MACS2 options:
-g
(genome size, mm=1.87e9)-B
/-bdg
(bedGraph format for pile-up files)-q
(The qvalue (minimum FDR) cutoff to call significant regions)
Thanks for the reply!
After trying what you suggested, MACS2 failed with an error message "Too few paired peaks (0) so I can not build the model! Broader your MFOLD range parameter may erase this error. If it still can't build the model, we suggest to use
--nomodel
and--extsize 147
or other fixed number instead." I checked the default MFOLD range and it is 5 - 50, which should be good enough. Any suggestions?The mfold range still relies upon the ChIP sample being adequately enriched, if not using
--nomodel --extsize N
will allow you to proceed. I was unclear whether you had paired reads or not. If you do then mfold is not required. If you are working with single-end reads cross-correlation can fail (be mislead) if too many reads fall into blacklist regions (https://sites.google.com/site/anshulkundaje/projects/blacklists). So removing such reads prior to peak calling can help.