Question

Create smallest set of k-mers using degenerate bases

0

Entering edit mode

3.0 years ago

albert • 0

I have a set of k-mers, all of equal length (eg. 10bp). They currently only contain the letters A, G, T and C. I am looking for a tool or script that can find the smallest set of k-mers that encapsulates all diversity in the input set using degenerate bases.

I am aware that there are tools such as DegePrime for identifying degenerate primers; however, I do not believe they cleanly fit this purpose as they expect multisequence alignments of long regions for amplification.

Example:

Input:

AAAAAAAAAA
TAAAAAAAAA
GCGAAAAAAA

Output:

WAAAAAAAAA
GCGAAAAAAA

Requirements:

Must scale to a large number of k-mers, eg. over 10,000
Can use any IUPAC ambiguous base to collapse a single position
Cannot add any new degeneracy (ie the sequences cannot expand to a sequence not in the input)
Must optimize for smallest set size

Thanks!

Edit: As there seems to be some confusion around the requirements, I've created some additional example cases (note: they may not be perfect, eg. I may have missed a solution, so feel free to correct any of them).

# Example 1
AAAAAAAAAA
TAAAAAAAAA

# Example 2: add in an additional seq that cannot be collapsed
AAAAAAAAAA
TAAAAAAAAA
GCGAAAAAAA

# Example 3: single seq
AAAAAAAAAA

# Example 4: very similar to example 1 but uses an N to capture all 4 options for the first base
AAAAAAAAAA
TAAAAAAAAA
CAAAAAAAAA
GAAAAAAAAA

# Example 5: all permutations are OK
AAAAAAAAAA
TAAAAAAAAA
AAAAAAAAAA
ATAAAAAAAA

# Example 6: All permutations not OK so keep some seqs separate, minimizing for set size (similar to example 4)
AAAAAAAAAA
TAAAAAAAAA
CAAAAAAAAA
GAAAAAAAAA
TACAGATACA
AACAGAAAAA

# Example 7: Multiple optimal solutions
AAAAAAAAAA
TAAAAAAAAA
ACAAAAAAAA
AGAAAAAAAA

# Output 1
WAAAAAAAAA

# Output 2
WAAAAAAAAA
GCGAAAAAAA

# Output 3
AAAAAAAAAA

# Output 4
NAAAAAAAAA

# Output 5
WWAAAAAAAA

# Output 6
NAAAAAAAAA
TACAGATACA
AACAGAAAAA

# Output 7
## Option 1
WAAAAAAAAA
ASAAAAAAAA
## Option 2
AVAAAAAAAA
TAAAAAAAAA

kmer degenerate • 1.8k views

ADD COMMENT • link updated 3.0 years ago by Alex Reynolds 36k • written 3.0 years ago by albert • 0

2

Entering edit mode

The smallest set is: NNNNNNNNNN

ADD REPLY • link 3.0 years ago by 5heikki 11k

0

Entering edit mode

may be N{10}

ADD REPLY • link 3.0 years ago by cpad0112 21k

0

Entering edit mode

These answers do not meet the requirement that the resulting set cannot add additional/new degeneracy. NNNNNNNNNN could be expanded to CCCCCCCCCC which is not in the original set, therefore this is not the correct solution.

ADD REPLY • link 3.0 years ago by albert • 0

0

Entering edit mode

What is the difference between this:

WAAAAAAAAA
GCGAAAAAAA

and this:

AAAAAAAAAA
KMRAAAAAAA

I'm unsure if there is a one-to-one mapping between your starting sequences and an encoded output.

ADD REPLY • link 3.0 years ago by Alex Reynolds 36k

0

Entering edit mode

KMRAAAAAAA could be rewritten as {G/T}{A/C}{A/G}AAAAAAA. Expanding all permutations of this reveals:

GAAAAAAAAA
GCAAAAAAAA
GCGAAAAAAA
GAGAAAAAAA
TAAAAAAAAA
TCAAAAAAAA
etc.

Not all of these k-mers are contained in the input set and therefore this is not an acceptable solution. Reversing this optimization must not add any new sequences.

It is possible that there will be multiple optimal solutions (ie minimal sets of equal size), but they can only be expanded to the input set with no additional sequences.

ADD REPLY • link 3.0 years ago by albert • 0

score 1 · Answer 1 · 2021-11-23

1

Entering edit mode

3.0 years ago

Alex Reynolds 36k

Interesting problem and I took a run at it here:

https://github.com/alexpreynolds/kmer-collapse

I would be curious to know if it scales to the size of your input set. Hope this is useful to you, in any case.

ADD COMMENT • link 3.0 years ago by Alex Reynolds 36k