Is there a way to identify statistically overrepresented Transcription Factor Binding Sites (TFBS) in a set of sequences compared against a control set using R?
Is there a way to identify statistically overrepresented Transcription Factor Binding Sites (TFBS) in a set of sequences compared against a control set using R?
Hi Diana
I presume that since you want to use R your happy at the command line. If so why not try clover for detection of over-represented TFBS. You have Jaspar matrices so here's a quick guide:
Download clover from http://zlab.bu.edu/clover and save it to the directory of your choice. If you're on linux make it executable (e.g. chmod +x clover.amd64). Download the Jaspar core motifs either using a browser or using wget
wget -r --no-parent http://jaspar.genereg.net/html/DOWNLOAD/jaspar_CORE/non_redundant/by_tax_group/vertebrates/FlatFileDir
Clean up any html files that might be in this folder but leave the matrix_list.txt file. Put the file somewhere sensible. On the clover website there is a link to a perl script which will convert these matrices into a format suitable for clover.
Download this file (http://zlab.bu.edu/~mfrith/downloads/jaspar2fasta.pl) and make sure it is executable (and you have perl installed). To run the conversion all you need to do is point the perl script at the directory of matrices and re-direct the output to a new directory.
perl jaspar2fasta.pl FlatFileDir > jasparVertCoreMotifs.txt
Here FlatFileDir is the directory that contains the JASPAR matrices (.pfm files) and I write the conversion to jasparVertCoreMotifs.txt.
To use Clover you need a set of sequences to compare your sequences of interest to. The clover website has some available. Basically they're just fasta files of e.g. the 2000bp upstream of human genes (from UCSC). You could make your own easily.
Now you have your genes of interest file (yourGenes.fa), background file (yourBG.fa) and our TFBS matrices (in Clover format).
We run Clover with:
./clover.amd64 -t 0.05 jasparVertCoreMotifs.txt yourGenes.fa yourBG.fa
The -t switch sets the p-value threshold for printing results. jasparVertCoreMotifs.txt is the motif file (in this case jaspar core vertebrate TFs) yourGenes.fa are the genes of interest and yourBG.fa are background files. We use only one here: yourBG.fa although you could define more than one.
We also need to Clover to be in the directory we run it from so I normally copy my fasta files and the clover executable into one directory and then cd into that directory and run clover.
Finally the output is not very friendly for import into spreadsheets etc. I don't know enough perl to clean it up but what I normally do is copy the significant results part (you'll see in the output file) into a new text file and then use awk to clean that up:
awk -v OFS="\t" '$1=$1' cloverOutput.txt > cloverOutputTabDelim.txt
Run the line above and the file cloverOutputTabDelim.txt will be tab delimited. It will still need some cleaning up though.
The paper describing the clover algorithm is:
Martin C. Frith et al., “Detection of functional DNA motifs via statistical over‐representation,” Nucleic Acids Research 32, no. 4 (March 15, 2004): 1372 -1381.
Edits and additions based on comments:
You have to run clover at the command line. Clicking on the executable won't work. I don't use Windows much but recently ran this type of analysis on a Windows machine.
First I put all my relevant files into one directory (ie the clover executable, promoters for my genes of interest in a fasta file, my background promoters in a fasta file and the Jaspar matrices prepared as above). You might want to create a directory on your Desktop for this as I did. Then I opened a terminal, changed into the relevant directory (from the C: directory)
cd /Users/yourAccount/Desktop/cloverTest
and ran clover with the following on the command line:
clover -t 0.05 jasparVertCoreMotifs2.txt fastaFileYourGenes.fa fastaFileYourBGGenes.fa > cloverResults.txt
This placed the output from clover into the cloverResults.txt file in the same directory clover was run from.
Windows doesn't come with a perl installation (unlike many Unix type OS) so in order to get your Jaspar matrices into the correct format you have a couple of choices. You could download the relevant file from the clover website (but I think it's a little out of date). You can install a Windows version of perl (I believe ActiveState Perl is popular) and use that (command line again), you could speak to someone who runs a linux or other Unix (eg MacOSX) and get them to do the conversion for you (or use their machine to do it yourself as per directions above), finally you could use a linux live cd to temporarily run linux on your machine and use that to do the matrix file conversion. Using a live cd will leave your windows files untouched. Google is your friend for using these!
HTH
duff
Thank you duff. I'm not very good at the command line but i'll try. I'm using Windows. I tried downloading the windows executable version of Clover but it doesn't do anything when i run the executable file. Should I download the source code and follow the steps that you've mentioned?
Hi, I realise this was a quite a long time ago but I'm at at the end of my tether!
I have done everything you have described above, however, clover is saying it doesn't recognise my matrix file. The file is in the format of:
>MA0001.1 AGL3
0 94 1 2
3 75 0 19
79 4 3 11
40 3 4 50
66 1 1 29
48 2 0 47
65 5 5 22
11 2 3 81
65 3 28 1
0 3 88 6
which, to my knowledge, should be recognisable. Do you have any advice?
I haven't identified TFBS in my sequences as yet. I have downloaded matrices from jaspar and I have my sequences in which i have to identify TFBS. One of the options to identify the TFBS is Patser but once I have the results how do I use the hypergeometric test in R?I have no clue. Thank you DK and Larry for your answers:)
PlantRegMap provides a tool to find the enriched TFs in provided sequenced for plants.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Have you already identified and classified the TFBS in your set of sequences and the control sequences? If not, there are a lot of HMM solutions for detecting TFBS. After you've identified and classified the TFBS, you can use a hypergeometric test to calculate the probability of overrepresentation.
You should take a good look at BioStar questions 2949, 7150 and 11704 as they all pertain to this topic. The answer to your question is "yes," and the details that are relevant to your project are likely found attached to the above questions.
I haven't identified TFBS in my sequences as yet. I have downloaded matrices from jaspar and I have my sequences in which i have to identify TFBS. One of the options to identify the TFBS is Patser but once I have the results how do I use the hypergeometric test in R?I have no clue. Thank you DK and Larry for your answers:)