Hi,
I have some CHiP-Seq data for transcription factor binding in Arabidopsis Thaliana (the model plant). The data is paired-end, with two replicates and a control (total input). I have trimmed and aligned the data and now have sorted, indexed BAM files (or BED files). Reads are 100bp each, with average DNA fragment sizes between 300-450 depending on the sample.
Viewing the reads in IGV I can see some regions (for two genes that we think are targets of the TF) that are highly enriched across the whole gene (rather than the promoter region), as well as various bits of noise where both Input and control have large peaks.
When I try using MACS, I get a huge list of peaks that include those two genes. But when I look at these other "peaks" in IGV, the plots are almost exactly the same shape between the ChIP and Input. They are sometimes different sizes (presumably due to read count), but on a visual inspection they look almost identical. My call to MACS is something like:
macs -t TF_3ul_P_sorted.bam -c TF_Input_P_sorted.bam -f BAM -g 111755668 -n TF_3ul -B -s 100 -S --bw=350
I've been looking for different Peak calling algorithms that are designed for paired-end reads and I seem to be struggling. A lot of the possible options then tell me they only take paired end data in the form of ELAND, whatever that is. Or I can't manage to successfully install them. I'm using a Windows 7 machine with a VirtualBox running Ubuntu. My Linux skills are fairly basic, and this is causing problems with installation of some of the tools that I find. Or they only work on Human/Mouse data, not Arabidopsis, which is completely useless to me.
Can anyone suggest a peak-calling algorithm that takes paired-end data and successfully removes peaks that are the same shape in the Input control sample?
Thanks!
what do you mean by shape? the enrichment relates to the coverage not the shapes. If the coverages are substantially higher then the peaks are valid.
The coverage in these peaks is often lower in the ChIP sample than the Input, prior to total read normalisation at least. The image that I've (hopefully) attached in this comment shows an example of such a 'peak', with the ChIP sample as the first row and the Input control as the second row. The scales go up to 28,985 and 26,471 respectively (so slightly higher in the ChIP).
well the data for this region looks identical, this is not an issue of peak detection anymore, there is no differential coverage over this area so no peaks should be called here.
If you think that there should be differential expression then it might be a sample mislabeling or other error.
I agree with you, this looks identical and therefore not a peak. My problem is that I'm getting features such as this being called as a peak.