Convert gene annotations from GFF to 3k-padded windows in BED via convert2bed
:
$ awk '$3 == "gene"' annotations.gff \
| convert2bed -i gff - \
| grep -wFf genes.txt - \
| awk -vwindow=3000 -vOFS="\t" '($6=="+"){ print $1, ($2 - window), $2, $4, ".", $6, $7, $8, $9, $10 }($6=="-"){ print $1, $3, ($3 + window), $4, ".", $6, $7, $8, $9, $10 }' \
> promoters.bed
The file genes.txt
would be a file containing a list of genes of interest. This is used with grep
to filter for your genes of interest.
Convert promoters.bed
to promoters.fa
via samtools faidx
, a set of reference genome FASTA files, and a helper script like bed2fastaidx.pl
, e.g.:
$ /path/to/bed2fastaidx.pl --fastaIsUncompressed --fastaDir=/path/to/genome/fasta < promoters.bed > promoters.fa
This script is available on Github Gist: https://bit.ly/2nCvej2
(Note to other mods: I'm using an URL shortener, as Gist code is otherwise pasted directly into the answer.)
Once you have this FASTA file, you can run it through a tool like FIMO, using a TF model database like Jaspar, UniPROBE, or TRANSFAC to find binding sites. Or you can use MEME to discover novel motif models and TOMTOM to compare them against existing, published TF model databases. Another tool people use for novel motif discovery and comparison against published models is HOMER.
I’ve found several posts about it. Some of them look useful.
Transcription Factor Binding Site Prediction
Advantages of Biobase's TRANSFAC over and above what is freely available?
Transcription factor binding site prediction program/algorithm
Convert binding site formats from different databases and compare the motifs - any tools available?
Best tool to find potential TF binding sites within a specific DNA sequence?