As a general problem, I frequently have some genomic locations, intervals or probes which are in some way selected from the remainder. For example, in ChipSeq I could have a number of intervals where peaks occurred. If I hypothesize that they all have a common motif, a pattern of some sort, is there an existing framework that would do this for me? Specifically, I would expect a counter and a statistical test of a pattern frequency in hits and misses.
With exact matches this is not difficult to implement, and I have done it, although speed might be an issue. The question is if one could allow mismatches, indels or regex-like structures to allow for a truly comprehensive search for sequence motifs.
So, provide two groups of sequences, count subsequences in both, and optionally perform statistical testing, ideally allowing complex matchings. Does this exist as a package or tool?
Thanks!
Essentially it is, and I didn't know that was a category or that those tools existed. Thank you sir, please post that as an answer.