Python Script Help for Cooccupancy
1
0
Entering edit mode
9.2 years ago
dally ▴ 210

I need a script or way to search for pol II summits that are greater than -5k +5k from the TSS start site all the way until the beginning of the next gene.

so basically something like this

-------(-5k)-------[|tss|==GENE1===]----(-5k)----(+5k)----[=====GENE2=====]-------(+5k)----------------------(-5k)-------------[===GENE3===]

So I need to find pol II summits > than -/+5k away from the TSS and then have it search for summits till it reaches the -5k region of the next gene.

So if you look at the figure I want it to search for Pol II summits from (-5k of GENE 1) to (+5k of GENE 2). Does that make sense?

Then using those summits I want to search for H3K4me3 marks -250 250 from the Pol II summit.

I have a current script that does something similar but it simply searches for summits in a certain integer region from the TSS (for instance -250 downstream, +1000 upstream) and then searches for all peaks of interested mark in a 250 window from that summit, but I need to modify it to accomplish this new task and I simply have no idea how.

This is a section of the script that I would need to update to search for summits from TSS.

cntdict = {}
cntdict["PolII"] = 0

for mark in marks.split(","):
    cntdict[mark] = 0

counter = 1
totalhits = 0
intersectlist = []

for locus in loci:
    print "Working on line# " + str(counter)
    counter += 1
    direction = str(locus[5])

    if direction == "+":
        tss = int(locus[1])
        cur.execute("select summit from CHIP_PEAK where mark = '%s' and chr = '%s' and ((summit > %i and summit < %i))" % ("PolII-Chip",str(locus[0]),tss-2500,tss+2500))
        print "select summit from CHIP_PEAK where mark = '%s' and chr = '%s' and ((summit > %i and summit < %i))" % ("PolII-Chip",str(locus[0]),tss-2500,tss+2500)
    else:
        tss = int(locus[2])
        cur.execute("select summit from CHIP_PEAK where mark = '%s' and chr = '%s' and ((summit > %i and summit < %i))" % ("PolII-Chip",str(locus[0]),tss-2500,tss+2500))
        print "select summit from CHIP_PEAK where mark = '%s' and chr = '%s' and ((summit > %i and summit < %i))" % ("PolII-Chip",str(locus[0]),tss-2500,tss+2500)

EDIT: Sorry for the lack of code formatting, I'm trying to figure it out now.

python cooccupancy • 2.6k views
ADD COMMENT
2
Entering edit mode
9.2 years ago

Much of bioinformatics is knowing when not to reinvent the wheel. In this case, you're looking for bedtools flank and bedtools intersect -v.

ADD COMMENT
0
Entering edit mode

Hmm, I see how this might work. I'm unsure what number would be appropriate for the "number of base pairs" to add to the start and end coordinates since all the regions between the gene lengths vary. Would I use a rather large number and let bedtools clip the coordinates as it does by default?

ADD REPLY
0
Entering edit mode

Add 5000, since that's what you specified in your question. You then exclude all intersecting summits and have your list of PolII summits to proceed with.

ADD REPLY
0
Entering edit mode

bedtools doesn't take into account that some intergenic regions between genes might be less than 5kb from the TSE does it? In that case it wouldn't give me intergenic pol II summits, but summits inside that gene as well

I'd want to exclude those summits that get pulled if they fall into that category

ADD REPLY
0
Entering edit mode

Just flank both ends of the transcript bounds. Using a refseq BED file from UCSC would be convenient for this.

ADD REPLY
0
Entering edit mode

Now if I want to overlap those Pol II summits I managed to pull up with say chip seq H3K4me1 marks in a -250, +250 window from the pol II summit to identify possible enhancer sites, can i use intersect for this too?

ADD REPLY
0
Entering edit mode

Yup, just flank either the H3K4me1 or PolII peaks by 250 on each side and then run the intersect.

ADD REPLY
0
Entering edit mode

This is what I actually did yesterday, so it's good to hear that that was the way to go about it.

I'm running into a problem where I'm picking up false positives (pol summits inside intragenic sites, or inside genes) ... I tried to weed these out by taking -v of my pol summit files + my flanked tss files and then intersecting (using a -v command so I only keep intergenic sites) it with a pol summit file + tss file (since this should give the intragenic summits, no -v option here) ... It made sense when I drew it down on paper, but these false positives obviously mean that I'm doing something wrong or my data is incorrect.

I'm assuming there is multiple tss and tse within a gene which means it's impossible to weed out all the pol II summits inside genes (right now I'd say 20% of data is false positives). What do you suggest or think?

ADD REPLY
0
Entering edit mode

If you intersect with "transcript" or "gene" entries from a GTF (or appropriate BED) file with the -v option you will never get intragenic peaks.

ADD REPLY
0
Entering edit mode

Also when I flank the Pol II summits, it gives me some that have a end site than the start site (ex: chr1: 1000-999) so it messes up the H3K1 and Pol intersect.

ADD REPLY

Login before adding your answer.

Traffic: 1937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6