I have a set of disease-causing mutations and their flanking neighborhoods (4bp on each side). I want to determine if these flanking neighborhoods are different from non-disease mutations
Eg.
ATTG M TTGA (M=Mutation, disease-causing)
TTAG M GAGG (M=Mutation, non-disease causing)
How do I estimate the background distribution of all such 4bp on each side neighborhoods for the entire genome? Can you help me formulate a statistical test to differentiate between the two?
Thanks for your reply. I have tried something very similar to your first suggestion. I tried finding motifs that are over-represented around disease-causing mutations as compared to non-disease. I also have a list of such motifs with their respective significance values. But the problem is I am unable to attach any biological significance to that. Say if I have AATG over-represented around a particular mutation type (C>G), how do I go back and check whether there's any biological basis for such behaviour. Any help is greatly appreciated!
That needs more information and biological context like if they are coding, non-coding, splice-site variants etc etc