Question

How to search for TF binding sites upstream of specific genes?

0

Entering edit mode

6.9 years ago

a.rex ▴ 350

I have a reference genome and set of annotations in a gff file (which of course I can easily convert to a fasta/bed etc).

I have a list of candidate genes which I want to extract an upstream 3kb sequence from.

I then want to run this upstream region through some TF prediction software to look for putative TF binding sites (i.e. potential enhancers).

Does anyone have a recommendation for which software to use for this? Also, how can I extract out the upstream region?

sequence • 2.2k views

ADD COMMENT • link updated 6.9 years ago by Alex Reynolds 36k • written 6.9 years ago by a.rex ▴ 350

0

Entering edit mode

I’ve found several posts about it. Some of them look useful.

Transcription Factor Binding Site Prediction

Advantages of Biobase's TRANSFAC over and above what is freely available?

Transcription factor binding site prediction program/algorithm

Convert binding site formats from different databases and compare the motifs - any tools available?

Best tool to find potential TF binding sites within a specific DNA sequence?

ADD REPLY • link 6.9 years ago by natasha.sernova ★ 4.0k

score 5 · Answer 1 · 2018-08-13

Convert gene annotations from GFF to 3k-padded windows in BED via convert2bed:

$ awk '$3 == "gene"' annotations.gff \
    | convert2bed -i gff - \
    | grep -wFf genes.txt - \
    | awk -vwindow=3000 -vOFS="\t" '($6=="+"){ print $1, ($2 - window), $2, $4, ".", $6, $7, $8, $9, $10 }($6=="-"){ print $1, $3, ($3 + window), $4, ".", $6, $7, $8, $9, $10 }' \
    > promoters.bed

The file genes.txt would be a file containing a list of genes of interest. This is used with grep to filter for your genes of interest.

Convert promoters.bed to promoters.fa via samtools faidx, a set of reference genome FASTA files, and a helper script like bed2fastaidx.pl, e.g.:

$ /path/to/bed2fastaidx.pl --fastaIsUncompressed --fastaDir=/path/to/genome/fasta < promoters.bed > promoters.fa

This script is available on Github Gist: https://bit.ly/2nCvej2

(Note to other mods: I'm using an URL shortener, as Gist code is otherwise pasted directly into the answer.)

Once you have this FASTA file, you can run it through a tool like FIMO, using a TF model database like Jaspar, UniPROBE, or TRANSFAC to find binding sites. Or you can use MEME to discover novel motif models and TOMTOM to compare them against existing, published TF model databases. Another tool people use for novel motif discovery and comparison against published models is HOMER.