Question

Overrepresented Transcription Factor Binding Sites (Tfbs)

2

Entering edit mode

13.7 years ago

Lisanne ▴ 30

Hi All,

I was wondering if there is a package available in R for identifying statistically overrepresented Transcription Factor Binding Sites (TFBS) in a set of sequences compared against a control set. Something like F-match.

In my investigation I have not find something useful.

Thanks for all answers

r enrichment transcription binding • 9.2k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 13.7 years ago by Lisanne ▴ 30

1

Entering edit mode

Does this question here help?

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 13.7 years ago by Dave Bridges ★ 1.4k

0

Entering edit mode

It should be comment not answer I guess!

ADD REPLY • link 13.7 years ago by Thaman ★ 3.3k

0

Entering edit mode

Cross-posted from http://stackoverflow.com/questions/5537666/overrepresented-transcription-factor-binding-sites-tfbs. We don't mind but we like to know :-)

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 13.7 years ago by Neilfws 49k

0

Entering edit mode

Actually, your question is not clear. Have you found the TFBS already and counted them? Or do you want to do pattern discovery in the sequences first, please clarify.

ADD REPLY • link 13.7 years ago by Michael 55k

Ram · Answer 1 · 2011-04-04

2

Entering edit mode

13.7 years ago

Chris Evelo 10k

For every type of TFBS you want to compare the number of binding sites in one set with the number in one other set? So essentially one number with one other number? If that is the case there is no way you can evaluate this statistically. You need multiple sets to compare. Or might be able to ask questions about sets of TFBS (say all TFBS from one GO class) compared to another sets.

ADD COMMENT • link 13.7 years ago by Chris Evelo 10k

0

Entering edit mode

I have to disagree with you on that not being possible. As much as I agree with the requirement of replication, there actually is replication in discrete count data, that is perfectly feasible.

ADD REPLY • link 13.7 years ago by Michael 55k

0

Entering edit mode

Michael I do not understand. If I have two counts say 425 and 240 what would the errors be? You probably mean I can calculate the errors while counting? How would I do that. I will happily delete this answer if it wrong, but I first would like to understand. Reading your own answer I think what you try to do is determine the statistics for TF binding motifs in multiple genes, counting every TFBS separately and then giving the significance for the motif not the site. That is of course possible, but I am not sure whether that was the question.

ADD REPLY • link 13.7 years ago by Chris Evelo 10k

0

Entering edit mode

Michael I do not understand. If I have two counts say 425 and 240 what would the errors be? You probably mean I can calculate the errors while counting? How would I do that? I will happily delete this answer if it wrong, but I first would like to understand. Reading your own answer I think what you try to do is determine the statistics for TF binding motifs in multiple genes, counting every TFBS separately and then giving the significance for the motif not the site. That is of course possible, but I am not sure whether that was the question.

ADD REPLY • link 13.7 years ago by Chris Evelo 10k

0

Entering edit mode

If you have at least 2 different Tf's you have 4 counts not 2. Then you can make a 2x2 contingency table, pretty much like the example in the wikipedia article: http://en.wikipedia.org/wiki/Fisher%27s_exact_test#Example

To make it more clear, replace 'dieting' and 'not dieting' TF1, TF2, and 'men', 'women' with 'control' and 'test-set'.

That is all that it takes, to test the null hypothesis that the two groups are independent.

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 13.7 years ago by Michael 55k

0

Entering edit mode

True, you can determine that the two samples are not (likely to be) the same. Although in reality you would not know whether that difference was caused by the treatment. If the results were completely random you would also find a difference between the two samples. Of course a simple test assumes that the only difference between the groups would be treatment. A model test would include a random factor which you would not be able to estimate. But more to the point: I thought he wanted to look for statistically overrepresented transcription factors not sample differences?

ADD REPLY • link 13.7 years ago by Chris Evelo 10k

0

Entering edit mode

In that case you maybe could use a one against all fisher test for each TF class where you tabulate TF1 against the sum of all other TFs (like not TF1). however, I fear I don't really understand what Lisanne really wants.

ADD REPLY • link 13.7 years ago by Michael 55k

score 2 · Answer 2 · 2011-04-04

2

Entering edit mode

13.7 years ago

Neilfws 49k

I don't believe that there is very much available in R for this task. Bioconductor has a few packages for experimental TFBS data such as:

rMAT - identification of regions enriched for transcription factor binding sites in ChIP-chip experiments on Affymetrix tiling arrays
iChip - identify enriched genomic regions in ChIP-chip data from multiple platforms (e.g. Affymetrix, Agilent, and NimbleGen)
BCRANK - predicting binding site consensus from ranked DNA sequences

But none of those immediately lend themselves to your problem. Best solution is probably to read up on the statistical methods used (which are rarely very complex) and implement it yourself (or ask a statistician friend for help).

ADD COMMENT • link 13.7 years ago by Neilfws 49k

0

Entering edit mode

http://bioconductor.org/packages/release/bioc/html/goseq.html provides gene set enrichment-style analysis tailored to count based data like that from sequencing experiments; perhaps that can be adapted?

ADD REPLY • link 13.7 years ago by Martin Morgan ★ 1.6k

0

Entering edit mode

There is no need to implement this yourself, it's all there.

ADD REPLY • link 13.7 years ago by Michael 55k

0

Entering edit mode

What's all where? :-)

ADD REPLY • link 13.7 years ago by Neilfws 49k

score 0 · Answer 3 · 2011-04-04

0

Entering edit mode

13.7 years ago

Will 4.6k

The PAINT promoter analysis tool (here) will do this sort of analysis. It can take either a gene list or fasta formatted promoter file. Then it will compare the occurrence counts within your gene-list to a background set.

ADD COMMENT • link 13.7 years ago by Will 4.6k

Ram · Answer 4 · 2011-04-04

0

Entering edit mode

13.7 years ago

Michael 55k

Yes, this problem is perfectly solvable using basic tests in R, as it can be reduced to a problem of independence of discrete counts in a contingency tables.
Given you labeled all your TFBS in both groups equally by transcription factor and assign counts for the control group (counts on the whole genome) and test group for each binding site. You can then tabulate the counts and then

come to my mind.

Both are implemented as basic R functions fisher.test and chisq.test.

ADD COMMENT • link 13.7 years ago by Michael 55k

0

Entering edit mode

Probably I misunderstood the intention of the OP due to poos formulation of the question, we'll see.

ADD REPLY • link 13.7 years ago by Michael 55k

0

Entering edit mode

I think you test groups of sites (motifs) here, not individual sites. See the comments with my own answer. Maybe your answer really is the intended one, but I think that in that case the question should be reformulated.

ADD REPLY • link 13.7 years ago by Chris Evelo 10k

0

Entering edit mode

Yes, I would do tests for counts of sites that are binding the same TF, thought that was the only way the question could make sense.

ADD REPLY • link 13.7 years ago by Michael 55k

0

Entering edit mode

Hi Michael Dundrup,

I know this is a few years too late but I was wondering what my data table should look like? As in what information would I need?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by bionovice • 0

score 0 · Answer 5 · 2011-04-06

Here is a non-R option.

I have used DAVID for the enrichment analysis of TFs. Yes, once you uploaded your gene list you have an option to perform enrichment calculations using a variety of annotations. DAVID provides a category called "UCSC_TFBS" in default setting this option unchecked. Please click and expand the "Protein_Interactions" and click on "UCSC_TFBS"

PS. I am not sure about source of this data set, so far I got only this link the source data on DAVID forum. It will be great if someone can tell more about the data source.