Question

Chip-seq best practice data analysis

3

Entering edit mode

5.4 years ago

elb ▴ 260

Hi guys, is here a complete and gold standard/best practice on-line tutorial (with tools and commands) I could follow in order to have a template on how to computationally analyse methylation/ acetylation/PolII/CTCF Chip-seq data from Next-seq 500 NGS? I mean from alignment on human to downstream analysis.

Thank you in advance

ChIP-Seq • 3.7k views

ADD COMMENT • link updated 5.4 years ago by ATpoint 87k • written 5.4 years ago by elb ▴ 260

0

Entering edit mode

There is rarely a gold standard (other than for diagnostic NGS protocols) for NGS analyses. Every dataset is different and you understand your own dataset/experiment better than anyone else. Here is a previous thread with links to get you started: ChIP-seq. analysis Tutorial for Dummies and Visualization for ChIP-seq analysis

ADD REPLY • link 5.4 years ago by GenoMax 150k

0

Entering edit mode

Could you be a bit more specific? What is the analysis goal, and where do you get stuck?

ADD REPLY • link 5.4 years ago by ATpoint 87k

0

Entering edit mode

Hi. I have to analize Chip-exo data for nascent RNAs but I'm totally new to the Chip analysis (I perform RNA seq analysis) and so my idea was to practice a little bit.

ADD REPLY • link 5.4 years ago by elb ▴ 260

score 3 · Answer 1 · 2019-10-27

Well, it comes down to the normal quality control workflow:

1) Alignment, check for good alignment rates. In human and mouse I typically get > 90% when using standard read length (> 50bp). Below is at least suspicious. If below check for contaminations or sequencing errors (fastqc, BLAST unmapped reads).

2) Check for sequence duplication rate. High duplication rate might indicate low library complexity and potentially a poor IP efficiency. I always pipe the alignment through samblaster to mark duplicates without removing them at this point. samtools flagstat can then count the number of duplicates.

3) Call peaks and calculate FRiPs (= how many peaks overlap with peaks and how many do not). This is a measure of signal/noise ratio. I am personally sceptical if below 5% (or 1% for some antibodies like H3K27ac). This is highly antibody-dependent but from what I've seen, below 1% is critical, you will see it on the genome browser (see point 4) as this is basically one noisy track without clear separation of peaks and background. Still, this is my experience and cannot be generalized. Also, be smart towards the number of peaks. If you ChIP an important transcription factor or a histone modification but only get like 500 peaks, something is wrong. Tens-of-thousands of peaks are probably expected.

4) Check data manually (=by eye) on a genome browser. Make sure that you visually have a good separation between peaks and background noise. Also check some positive control regions (in case you have some where you know by experience or literature that they must have a strong and high-quality IP enrichment).

5) Perform PCA based on the log2-normalized read counts. Check if the replicates cluster together and you have no outliers that might indicate issues with IP efficiency or chromatin integrity. Also, Pearson correlation for the read counts of the peaks might be informative.

6) Compare data with published datasets in terms of quality. If you are unsure if poor quality is a product of a poor antibody, see if there are published data made with the same antibody (and maybe even in a similar cell type). There are antibodies that will typically produce suboptimal ChIPs (in my experience e.g. CEBPA or H3K27ac for human/mouse) while others (like H3K4me1) tend to give really good enrichments. If antibody quality is an issue you will probably need more replicates than with a good AB to ensure results are reliable.