I need a script or way to search for pol II summits that are greater than -5k +5k from the TSS start site all the way until the beginning of the next gene.
so basically something like this
-------(-5k)-------[|tss|==GENE1===]----(-5k)----(+5k)----[=====GENE2=====]-------(+5k)----------------------(-5k)-------------[===GENE3===]
So I need to find pol II summits > than -/+5k away from the TSS and then have it search for summits till it reaches the -5k region of the next gene.
So if you look at the figure I want it to search for Pol II summits from (-5k of GENE 1) to (+5k of GENE 2). Does that make sense?
Then using those summits I want to search for H3K4me3 marks -250 250 from the Pol II summit.
I have a current script that does something similar but it simply searches for summits in a certain integer region from the TSS (for instance -250 downstream, +1000 upstream) and then searches for all peaks of interested mark in a 250 window from that summit, but I need to modify it to accomplish this new task and I simply have no idea how.
This is a section of the script that I would need to update to search for summits from TSS.
cntdict = {}
cntdict["PolII"] = 0
for mark in marks.split(","):
cntdict[mark] = 0
counter = 1
totalhits = 0
intersectlist = []
for locus in loci:
print "Working on line# " + str(counter)
counter += 1
direction = str(locus[5])
if direction == "+":
tss = int(locus[1])
cur.execute("select summit from CHIP_PEAK where mark = '%s' and chr = '%s' and ((summit > %i and summit < %i))" % ("PolII-Chip",str(locus[0]),tss-2500,tss+2500))
print "select summit from CHIP_PEAK where mark = '%s' and chr = '%s' and ((summit > %i and summit < %i))" % ("PolII-Chip",str(locus[0]),tss-2500,tss+2500)
else:
tss = int(locus[2])
cur.execute("select summit from CHIP_PEAK where mark = '%s' and chr = '%s' and ((summit > %i and summit < %i))" % ("PolII-Chip",str(locus[0]),tss-2500,tss+2500))
print "select summit from CHIP_PEAK where mark = '%s' and chr = '%s' and ((summit > %i and summit < %i))" % ("PolII-Chip",str(locus[0]),tss-2500,tss+2500)
EDIT: Sorry for the lack of code formatting, I'm trying to figure it out now.
Hmm, I see how this might work. I'm unsure what number would be appropriate for the "number of base pairs" to add to the start and end coordinates since all the regions between the gene lengths vary. Would I use a rather large number and let bedtools clip the coordinates as it does by default?
Add 5000, since that's what you specified in your question. You then exclude all intersecting summits and have your list of PolII summits to proceed with.
bedtools doesn't take into account that some intergenic regions between genes might be less than 5kb from the TSE does it? In that case it wouldn't give me intergenic pol II summits, but summits inside that gene as well
I'd want to exclude those summits that get pulled if they fall into that category
Just flank both ends of the transcript bounds. Using a refseq BED file from UCSC would be convenient for this.
Now if I want to overlap those Pol II summits I managed to pull up with say chip seq H3K4me1 marks in a -250, +250 window from the pol II summit to identify possible enhancer sites, can i use intersect for this too?
Yup, just flank either the H3K4me1 or PolII peaks by 250 on each side and then run the intersect.
This is what I actually did yesterday, so it's good to hear that that was the way to go about it.
I'm running into a problem where I'm picking up false positives (pol summits inside intragenic sites, or inside genes) ... I tried to weed these out by taking
-v
of my pol summit files + my flanked tss files and then intersecting (using a-v
command so I only keep intergenic sites) it with a pol summit file + tss file (since this should give the intragenic summits, no-v
option here) ... It made sense when I drew it down on paper, but these false positives obviously mean that I'm doing something wrong or my data is incorrect.I'm assuming there is multiple tss and tse within a gene which means it's impossible to weed out all the pol II summits inside genes (right now I'd say 20% of data is false positives). What do you suggest or think?
If you intersect with "transcript" or "gene" entries from a GTF (or appropriate BED) file with the -v option you will never get intragenic peaks.
Also when I flank the Pol II summits, it gives me some that have a end site than the start site (ex: chr1: 1000-999) so it messes up the H3K1 and Pol intersect.