Hi all,
I have been trying to use kmer analysis (using k=3) to identify phenotypes of behavioral sequences.
For example, if each letter is a behavior within a courtship display, I could have the following:
Species A: R R R R R S H E
Species B: P P P P P P A S H E
Hybrid 1: P P P P R R E
Hybrid 2: R R R R R P E
The idea is for the kmer to be able to separate all individuals into species A, B, and various hybrid phenotypes based on the sequences they perform. It has actually done a very good job separating the parent species and intermediate hybrids, but seemingly backcrossed hybrids (i.e., act like Species A, but do a single behavior that Species B does) are often placed incorrectly with Species A).
I've tried to find ways to weigh characters or eliminate repetitive 3mers to try and avoid biasing the analysis, but I haven't been able to do so.
Is anyone familiar with kmer analysis? If so, do you have any suggestions?
Looks like a problem for a hidden Markov model.