I am new to Bioinformatics, with minimal experience in Terminal or scripting languages. There is a large chance that I will not be able to convey what I want to accomplish, please ask questions so that I may clarify.
Statement of Problem: I have run ChIP-Seq experiments on several enhancers and protein markers (I will refer to all of these as marks). I have already gone ahead and aligned the data using Bowtie, and then called peaks using MACS2. As a follow-up to a previous paper published by another lab member I want to find the co-occupancy of three of these marks in relation to the transcription start site (TSS) to determine how many of these peaks overlap and then create a venn diagram to show these overlaps / intersections (I am not sure on the terminology). The problem is I am not exactly sure how to do this and after almost 8+ hours of attempts I have gotten absolutely no-where, though I will post my attempts.
For example, from the Pol II peaks I have identified at TSS or intergenic regions, now I would like to define overlays with other factors/marks. MY PEAK FILES ARE IN .BED FORMAT.
I want to look at overlay between Pol II peaks and H3K4me1/H3K27Ac at intergenic regions.
How do I accomplish my goals?
Previous Study Information: As mentioned earlier an ex-lab member wrote a paper and was able to achieve this. He left behind a python script which I will copy down in a bit, but I can't seem to get it to work, nor do I really want to. I don't want to blindly trust the output of the script without knowing how and what it is doing. The image below is from the ex-lab member's paper and is basically what I would like to accomplish:
I was able to run the Python script the ex-lab member used to generate these venn diagrams for Polymerase II only, and from what I was able to understand the script basically outputs a "total # of intersections" line that he then plugged into somewhere in order to generate these diagrams.
I have uploaded the script to github so that you may view it here: https://github.com/Vov-Bio/findcooccupancy.py/blob/master/Script
What I Have Done: I want to provide a bit of what I have done so far, not only to show that I have attempted to solve the problem, but so that someone can steer me in the right direction.
First thing I did was research and see if I could find any information on how to overlap my data. The only solid topic I could find was here: How to determine genome wide Co-occupancy between two transcription factors?
In this post someone someone says "you can make venn diagram to show common and unique peaks of two transcription factors and could also make a line plot around TSS showing average occupancy of both" using https://usegalaxy.org/
This defines what I want to do almost perfectly.
Other searches brought up two interesting options:
- BEDtools
- Homer
I attempted to use BEDtools multiintersectBED function but could not find the peak intersection counts that I wanted.
I then attempted to use Homer mergePeaks function to find peak intersection counts and got 'something', but I am almost positive it is incorrect.
Essentially I ran the following function:
mergePeaks -d 250 -matrix -ghist ~/data/TSSforAllAnnotatedGenes.txt ~/data/Pol-II-Chip.MACS2_summits.bed
and received
/User/data/TSSforAllAnnotatedGenes.txt 0 13062
/User/data/Pol-II-Chip.MACS2_summits.bed 11608 0
I assumed that 11,608 would be my Pol-II number for my venn diagram, but after attempting to run the same command for other peak files I am not so sure.
Any help is appreciated.
See if this online course on Bioconductor from Kasper Daniel Hansen can give you some idea on how to do the intersection. In particular check the AnnotationHub lecture for retrieving TSS and the GenomicRanges for the intersections. The video 12 and 13 of week 1 show a similar case than what you describe, at least as a start.