I have a set of k-mers, all of equal length (eg. 10bp). They currently only contain the letters A, G, T and C. I am looking for a tool or script that can find the smallest set of k-mers that encapsulates all diversity in the input set using degenerate bases.
I am aware that there are tools such as DegePrime for identifying degenerate primers; however, I do not believe they cleanly fit this purpose as they expect multisequence alignments of long regions for amplification.
Example:
Input:
AAAAAAAAAA
TAAAAAAAAA
GCGAAAAAAA
Output:
WAAAAAAAAA
GCGAAAAAAA
Requirements:
- Must scale to a large number of k-mers, eg. over 10,000
- Can use any IUPAC ambiguous base to collapse a single position
- Cannot add any new degeneracy (ie the sequences cannot expand to a sequence not in the input)
- Must optimize for smallest set size
Thanks!
Edit: As there seems to be some confusion around the requirements, I've created some additional example cases (note: they may not be perfect, eg. I may have missed a solution, so feel free to correct any of them).
# Example 1
AAAAAAAAAA
TAAAAAAAAA
# Example 2: add in an additional seq that cannot be collapsed
AAAAAAAAAA
TAAAAAAAAA
GCGAAAAAAA
# Example 3: single seq
AAAAAAAAAA
# Example 4: very similar to example 1 but uses an N to capture all 4 options for the first base
AAAAAAAAAA
TAAAAAAAAA
CAAAAAAAAA
GAAAAAAAAA
# Example 5: all permutations are OK
AAAAAAAAAA
TAAAAAAAAA
AAAAAAAAAA
ATAAAAAAAA
# Example 6: All permutations not OK so keep some seqs separate, minimizing for set size (similar to example 4)
AAAAAAAAAA
TAAAAAAAAA
CAAAAAAAAA
GAAAAAAAAA
TACAGATACA
AACAGAAAAA
# Example 7: Multiple optimal solutions
AAAAAAAAAA
TAAAAAAAAA
ACAAAAAAAA
AGAAAAAAAA
# Output 1
WAAAAAAAAA
# Output 2
WAAAAAAAAA
GCGAAAAAAA
# Output 3
AAAAAAAAAA
# Output 4
NAAAAAAAAA
# Output 5
WWAAAAAAAA
# Output 6
NAAAAAAAAA
TACAGATACA
AACAGAAAAA
# Output 7
## Option 1
WAAAAAAAAA
ASAAAAAAAA
## Option 2
AVAAAAAAAA
TAAAAAAAAA
The smallest set is: NNNNNNNNNN
may be N{10}
These answers do not meet the requirement that the resulting set cannot add additional/new degeneracy. NNNNNNNNNN could be expanded to CCCCCCCCCC which is not in the original set, therefore this is not the correct solution.
What is the difference between this:
and this:
I'm unsure if there is a one-to-one mapping between your starting sequences and an encoded output.
KMRAAAAAAA
could be rewritten as{G/T}{A/C}{A/G}AAAAAAA
. Expanding all permutations of this reveals:Not all of these k-mers are contained in the input set and therefore this is not an acceptable solution. Reversing this optimization must not add any new sequences.
It is possible that there will be multiple optimal solutions (ie minimal sets of equal size), but they can only be expanded to the input set with no additional sequences.