I was wondering if there is a package available in R for identifying statistically overrepresented Transcription Factor Binding Sites (TFBS) in a set of sequences compared against a control set. Something like F-match.
In my investigation I have not find something useful.
Actually, your question is not clear. Have you found the TFBS already and counted them? Or do you want to do pattern discovery in the sequences first, please clarify.
For every type of TFBS you want to compare the number of binding sites in one set with the number in one other set? So essentially one number with one other number? If that is the case there is no way you can evaluate this statistically. You need multiple sets to compare. Or might be able to ask questions about sets of TFBS (say all TFBS from one GO class) compared to another sets.
I have to disagree with you on that not being possible. As much as I agree with the requirement of replication, there actually is replication in discrete count data, that is perfectly feasible.
Michael I do not understand. If I have two counts say 425 and 240 what would the errors be? You probably mean I can calculate the errors while counting? How would I do that. I will happily delete this answer if it wrong, but I first would like to understand.
Reading your own answer I think what you try to do is determine the statistics for TF binding motifs in multiple genes, counting every TFBS separately and then giving the significance for the motif not the site. That is of course possible, but I am not sure whether that was the question.
Michael I do not understand. If I have two counts say 425 and 240 what would the errors be? You probably mean I can calculate the errors while counting? How would I do that? I will happily delete this answer if it wrong, but I first would like to understand. Reading your own answer I think what you try to do is determine the statistics for TF binding motifs in multiple genes, counting every TFBS separately and then giving the significance for the motif not the site. That is of course possible, but I am not sure whether that was the question.
True, you can determine that the two samples are not (likely to be) the same. Although in reality you would not know whether that difference was caused by the treatment. If the results were completely random you would also find a difference between the two samples. Of course a simple test assumes that the only difference between the groups would be treatment. A model test would include a random factor which you would not be able to estimate. But more to the point: I thought he wanted to look for statistically overrepresented transcription factors not sample differences?
In that case you maybe could use a one against all fisher test for each TF class where you tabulate TF1 against the sum of all other TFs (like not TF1). however, I fear I don't really understand what Lisanne really wants.
I don't believe that there is very much available in R for this task. Bioconductor has a few packages for experimental TFBS data such as:
rMAT - identification of regions enriched for transcription factor binding sites in ChIP-chip experiments on Affymetrix tiling arrays
iChip - identify enriched genomic regions in ChIP-chip data from multiple platforms (e.g. Affymetrix, Agilent, and NimbleGen)
BCRANK - predicting binding site consensus from ranked DNA sequences
But none of those immediately lend themselves to your problem. Best solution is probably to read up on the statistical methods used (which are rarely very complex) and implement it yourself (or ask a statistician friend for help).
The PAINT promoter analysis tool (here) will do this sort of analysis. It can take either a gene list or fasta formatted promoter file. Then it will compare the occurrence counts within your gene-list to a background set.
Yes, this problem is perfectly solvable using basic tests in R, as it can be reduced to a problem of independence of discrete counts in a contingency tables.
Given you labeled all your TFBS in both groups equally by transcription factor and assign counts for the control group (counts on the whole genome) and test group for each binding site. You can then tabulate the counts and then
I think you test groups of sites (motifs) here, not individual sites. See the comments with my own answer. Maybe your answer really is the intended one, but I think that in that case the question should be reformulated.
I have used DAVID for the enrichment analysis of TFs. Yes, once you uploaded your gene list you have an option to perform enrichment calculations using a variety of annotations. DAVID provides a category called "UCSC_TFBS" in default setting this option unchecked. Please click and expand the "Protein_Interactions" and click on "UCSC_TFBS"
PS. I am not sure about source of this data set, so far I got only this link the source data on DAVID forum. It will be great if someone can tell more about the data source.
Does this question here help?
It should be comment not answer I guess!
Cross-posted from http://stackoverflow.com/questions/5537666/overrepresented-transcription-factor-binding-sites-tfbs. We don't mind but we like to know :-)
Actually, your question is not clear. Have you found the TFBS already and counted them? Or do you want to do pattern discovery in the sequences first, please clarify.