Entering edit mode
5.8 years ago
Gene_MMP8
▴
240
I have a set of di-, tri- and tetra-nucleotide motifs from the coding region of the genome that are over-represented. Is there any way to establish the biological significance of the over-representation? Right now it's just statistically significant with respect to a null model (of random sequences). Just like TRANSFAC that contains the significance of short motifs for regulatory regions, is there an equivalent database for coding regions as well?
What are you trying to show? It seems to me all you’ve found so far is the beginning of codon bias, which is already a well known phenomenon.
I have a list of mutations and the motifs are the bases flanking the mutation. I want to check whether over-representation of certain motifs influences the type of mutation.
You should really have added the information about the mutations to your post.
First thing you could consider doing is sequence querying the motifs to see if they map to insertion elements and/or inverted repeats. These are common mutation signatures, and you may first want to remove these from your dataset (or at least annotate them as such).