I am about to do a bunch of experiments using Chip-seq data, I picked up a set as exercice from GEO and for a given histone modification mark we have two files, replicate 1 and 2.
Some Peak calling algorithms (at least those I found ) ask for control data sets, are these tags aligned in replicate 2 what they mean by control data ?
Control data is just one sample that skipped the IP part of the experiment. It is sequenced in order to be able to correct for sequencing bias, chromosomal duplications, etc. I.e. you want to know if a 'peak' is real or if it's just the product of non-uniform sampling. The 'best' way to do it is to be able to compare to a sample that is exactly the same, but IP was not performed.
If you haven't performed your experiment, I'd recommend a few things in order to get better / decent results downstream:
Try to use pair-end reads (if possible at least 70bp). It helps estimating the fragment size.
Have one 'control' per experiment (i.e. one sample that skips the IP part and it's sequenced).
If you don't blindly trust your antibody, perform biological replicates (yes, I know it costs twice as much).
Filter out low quality mapped reads (e.g mapq < 20)
Make sure your algorithm treats multiple mapping reads properly, or filter them out.
If you use MACS, make sure you run it multiple times using different min_fold parameters (and sometimes you must tune max_fold as well).
Perform saturation plots to see if you need more sequencing.
Could you elaborate more on your 1st and last point. Your 1st point about pair-end is preferred, does this hold true for TF where you would expect peaks in the several hundred base pair range? Wouldn't the fragment size be determined by the size of band that is cut out from gel?
The size you cut the gel is one parameter. But there are some biases in other steps of the process (e.g. when fragment are amplified for sequencing). As a result you have a 'convolution' of several probability distribution functions and the true average fragment length may be different from the size you selected.
If single end reads are provided, the algorithm tries to determine the fragment size by using reads mapped to positive and negative strands. If you have pair-ends, then the fragment was sequenced from both ends, so you know the fragment size.
On the last point: Once you tunes MACS, you can sub-sample your data and re-run MACS (e.g. sub sample 60%, 65%, 70%,..., 95%). If you see that the number of peaks is constantly increasing, this may give you a hint that you need more sequencing. On the other hand, if you see a 'plateau', this means that you may have found "all the peaks".
Usually control data just means sonicated DNA from the same sample that hasn't been "chipped" with the antibody, commonly called "Input". Some people call this Sono-seq:
If you are working with a cancer cell line the input file is important to figure out where the chromosomal duplications are. Assuming you align to reference genome (non-cancer human) you might see enrichment in certain regions that is due to duplication of that region. Some groups merge the two replicates into one big file, other groups run peak caller on both and just use whatever one looks better, another way ive seen is take interset of two replicates. The way that I've been looking into doing things is using an RNA-seq method to estimate variance w/the replicates and then call peaks.
And what if we dont have such data, do we proceed with peak calling without control ? what consequence tat have on the final results. That raises another question, what do we do with rep2, are these technical replicates ?
If you are working with a cancer cell line the input file is important to figure out where the chromosomal duplications are. Assuming you align to reference genome (non-cancer human) you might see enrichment in certain regions that is due to duplication of that region.
Some groups merge the two replicates into one big file, other groups run peak caller on both and just use whatever one looks better, the way that I've been looking into doing things is using an RNA-seq method to estimate variance w/the replicates and then call peaks.
there is more than just a copy number problem: the release of chromatin is not equal in all genomic regions. e.g. you will get more reads in active regions. hence, without an input control you cannot estimate if your peaks are IP specific or just a systematic error of uneven chromatin release and/or copy number variation.
-> you need a control!
By the way, if you want to download MACS, the user/pass in the puzzle is macs/chipseq (it's a Caesar ciphered text).
By the way, if you want to download MACS, the text in the puzzle says "username: macs n password: chipseq" (it's a Caesar ciphered text).
Could you elaborate more on your 1st and last point. Your 1st point about pair-end is preferred, does this hold true for TF where you would expect peaks in the several hundred base pair range? Wouldn't the fragment size be determined by the size of band that is cut out from gel?
The size you cut the gel is one parameter. But there are some biases in other steps of the process (e.g. when fragment are amplified for sequencing). As a result you have a 'convolution' of several probability distribution functions and the true average fragment length may be different from the size you selected. If single end reads are provided, the algorithm tries to determine the fragment size by using reads mapped to positive and negative strands. If you have pair-ends, then the fragment was sequenced from both ends, so you know the fragment size.
On the last point: Once you tunes MACS, you can sub-sample your data and re-run MACS (e.g. sub sample 60%, 65%, 70%,..., 95%). If you see that the number of peaks is constantly increasing, this may give you a hint that you need more sequencing. On the other hand, if you see a 'plateau', this means that you may have found "all the peaks".