I am now trying to locate single specific transcription factor binding site to over 100kb sequences of ~1000 genes. But it does not matter how good the binding matrix is and how much I minimize the false positive rate, every matrix has a specific error rate. That's why binding site will be found in every gene in such long sequences. So, I want to find genes enriched in that specific binding site in their regulatory sequence.
Which test should I use and how for such enrichment analysis?
I can calculate the number of hits per gene in test genes and I approximately know the error rate of binding matrix per kb for given cut-off for similarity (given in Transfac database).
Thanks for help.
If i understand correctly you are looking at ~1000x 100kb sequences. If so this is probably inadvisable as sequences of this length are likely to cover other gene regulatory regions. Apologies if i misunderstood! The problem is chiefly that there will be a lot of background noise generated from non-gene-of-interest genes.