You could convert your TFBS data to BED:
#!/usr/bin/env perl
use strict;
use warnings;
my @line = map { [] } 1..6;
my $idx = 0;
while(<STDIN>) {
chomp;
if ($_ =~ '^>') {
my @elems = split(/[>:\-\\(\\)]/, $_);
if (!defined $elems[4]) {
$elems[4] = "-";
}
$line[0] = $elems[1];
$line[1] = $elems[2];
$line[2] = $elems[3];
$line[3] = "id-".$idx++;
$line[4] = 0;
$line[5] = $elems[4];
print join("\t", @line)."\n";
}
}
Usage:
$ ./tfbs2bed.pl < tfbs.txt | sort-bed - > tfbs.bed
Then pad this BED file by whatever window you like, say a roughly 1000 base window that is centered on the binding site - in other words, 500 bases that are up- and downstream of the edges of each binding site:
$ bedops --range 500 --everything tfbs.bed > tfbs_1kpad.bed
Then do set operations to retrieve the genes associated with these windows.
First, depending on the genome you are working with, you might download a set of human (hg19) genes, using gtf2bed to make a BED file, e.g.:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gtf.gz \
| gunzip -c - \
| gtf2bed - \
| grep -w gene - \
> gencode.v21.genes.bed
Second, you could then look for any of these genes that overlap your TFBS windows:
$ bedops --element-of 1 gencode.v21.genes.bed tfbs_1kpad.bed > genes_overlapping_tfbs_1kpad.bed
As a rough TSS, you might grab the start position of the forward-stranded gene subset and then use bedops to pad the upstream range, say by 1000 bases (you can adjust padding to match the size of the promoter region you are interested in):
$ awk '$6=="+"' genes_overlapping_tfbs_1kpad.bed \
| awk '{$3=$2+1; print $0}' - \
| bedops --range 1000:0 --everything - \
> genes_overlapping_tfbs_1kpad.for.tss.upstream1k.bed
Then repeat this for the reverse strand, using the end position as the transcriptional start site of the gene, padding "downstream" (which is really upstream of the TSS, relative to strand orientation):
$ awk '$6=="-"' genes_overlapping_tfbs_1kpad.bed \
| awk '{$2=$3-1; print $0}' - \
| bedops --range 0:1000 --everything - \
> genes_overlapping_tfbs_1kpad.rev.tss.upstream1k.bed
Then union the two files into one BED file:
$ bedops --everything genes_overlapping_tfbs_1kpad.*.tss.upstream1k.bed > genes_overlapping_tfbs_1kpad.tss.upstream1k.bed
This final file contains 1k regions upstream of the TSS of any genes that overlap an approximately 1k window centered on each MEF2 binding site.