Question

Over-Represented Transcription Factor Binding Sites

4

Entering edit mode

13.1 years ago

Diana ▴ 930

Is there a way to identify statistically overrepresented Transcription Factor Binding Sites (TFBS) in a set of sequences compared against a control set using R?

transcription binding • 6.1k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 13.1 years ago by Diana ▴ 930

0

Entering edit mode

Have you already identified and classified the TFBS in your set of sequences and the control sequences? If not, there are a lot of HMM solutions for detecting TFBS. After you've identified and classified the TFBS, you can use a hypergeometric test to calculate the probability of overrepresentation.

ADD REPLY • link 13.1 years ago by Damian Kao 16k

0

Entering edit mode

You should take a good look at BioStar questions 2949, 7150 and 11704 as they all pertain to this topic. The answer to your question is "yes," and the details that are relevant to your project are likely found attached to the above questions.

ADD REPLY • link 13.1 years ago by Larry_Parnell 16k

0

Entering edit mode

I haven't identified TFBS in my sequences as yet. I have downloaded matrices from jaspar and I have my sequences in which i have to identify TFBS. One of the options to identify the TFBS is Patser but once I have the results how do I use the hypergeometric test in R?I have no clue. Thank you DK and Larry for your answers:)

ADD REPLY • link 13.1 years ago by Diana ▴ 30

Ram · Answer 1 · 2011-11-23

Hi Diana

I presume that since you want to use R your happy at the command line. If so why not try clover for detection of over-represented TFBS. You have Jaspar matrices so here's a quick guide:

Download clover from http://zlab.bu.edu/clover and save it to the directory of your choice. If you're on linux make it executable (e.g. chmod +x clover.amd64). Download the Jaspar core motifs either using a browser or using wget

wget -r --no-parent http://jaspar.genereg.net/html/DOWNLOAD/jaspar_CORE/non_redundant/by_tax_group/vertebrates/FlatFileDir

Clean up any html files that might be in this folder but leave the matrix_list.txt file. Put the file somewhere sensible. On the clover website there is a link to a perl script which will convert these matrices into a format suitable for clover.

Download this file (http://zlab.bu.edu/~mfrith/downloads/jaspar2fasta.pl) and make sure it is executable (and you have perl installed). To run the conversion all you need to do is point the perl script at the directory of matrices and re-direct the output to a new directory.

perl jaspar2fasta.pl FlatFileDir > jasparVertCoreMotifs.txt

Here FlatFileDir is the directory that contains the JASPAR matrices (.pfm files) and I write the conversion to jasparVertCoreMotifs.txt.

To use Clover you need a set of sequences to compare your sequences of interest to. The clover website has some available. Basically they're just fasta files of e.g. the 2000bp upstream of human genes (from UCSC). You could make your own easily.

Now you have your genes of interest file (yourGenes.fa), background file (yourBG.fa) and our TFBS matrices (in Clover format).

We run Clover with:

./clover.amd64 -t 0.05 jasparVertCoreMotifs.txt yourGenes.fa yourBG.fa

The -t switch sets the p-value threshold for printing results. jasparVertCoreMotifs.txt is the motif file (in this case jaspar core vertebrate TFs) yourGenes.fa are the genes of interest and yourBG.fa are background files. We use only one here: yourBG.fa although you could define more than one.

We also need to Clover to be in the directory we run it from so I normally copy my fasta files and the clover executable into one directory and then cd into that directory and run clover.

Finally the output is not very friendly for import into spreadsheets etc. I don't know enough perl to clean it up but what I normally do is copy the significant results part (you'll see in the output file) into a new text file and then use awk to clean that up:

awk -v OFS="\t" '$1=$1' cloverOutput.txt > cloverOutputTabDelim.txt

Run the line above and the file cloverOutputTabDelim.txt will be tab delimited. It will still need some cleaning up though.

The paper describing the clover algorithm is:

Martin C. Frith et al., “Detection of functional DNA motifs via statistical over‐representation,” Nucleic Acids Research 32, no. 4 (March 15, 2004): 1372 -1381.

Edits and additions based on comments:

You have to run clover at the command line. Clicking on the executable won't work. I don't use Windows much but recently ran this type of analysis on a Windows machine.

First I put all my relevant files into one directory (ie the clover executable, promoters for my genes of interest in a fasta file, my background promoters in a fasta file and the Jaspar matrices prepared as above). You might want to create a directory on your Desktop for this as I did. Then I opened a terminal, changed into the relevant directory (from the C: directory)

cd /Users/yourAccount/Desktop/cloverTest

and ran clover with the following on the command line:

clover -t 0.05 jasparVertCoreMotifs2.txt fastaFileYourGenes.fa fastaFileYourBGGenes.fa > cloverResults.txt

This placed the output from clover into the cloverResults.txt file in the same directory clover was run from.

Windows doesn't come with a perl installation (unlike many Unix type OS) so in order to get your Jaspar matrices into the correct format you have a couple of choices. You could download the relevant file from the clover website (but I think it's a little out of date). You can install a Windows version of perl (I believe ActiveState Perl is popular) and use that (command line again), you could speak to someone who runs a linux or other Unix (eg MacOSX) and get them to do the conversion for you (or use their machine to do it yourself as per directions above), finally you could use a linux live cd to temporarily run linux on your machine and use that to do the matrix file conversion. Using a live cd will leave your windows files untouched. Google is your friend for using these!

HTH

duff

score 0 · Answer 2 · 2011-11-23

0

Entering edit mode

13.1 years ago

Diana ▴ 30

I haven't identified TFBS in my sequences as yet. I have downloaded matrices from jaspar and I have my sequences in which i have to identify TFBS. One of the options to identify the TFBS is Patser but once I have the results how do I use the hypergeometric test in R?I have no clue. Thank you DK and Larry for your answers:)

ADD COMMENT • link 13.1 years ago by Diana ▴ 30

0

Entering edit mode

Can you move this to a comment on the original question rather than an 'answer'?

ADD REPLY • link 13.1 years ago by Niallhaslam 2.3k

0

Entering edit mode

I mistakenly put it as an answer and i can't find a way to delete it but its in the comments as well

ADD REPLY • link 13.1 years ago by Diana ▴ 30

score 0 · Answer 3 · 2016-10-20

0

Entering edit mode

8.2 years ago

jin ▴ 80

PlantRegMap provides a tool to find the enriched TFs in provided sequenced for plants.

ADD COMMENT • link 8.2 years ago by jin ▴ 80