Question

Find All K-Mers In One Set Of Sequences Not In The Other Set

2

Entering edit mode

12.9 years ago

David M ▴ 580

I'd like to identify all kmers in one set of transcriptomic data that are not in the other set. I am dealing with large amounts of data, but it seems to me that sequence assembly regularly performs this task, and that it could be easily accomplished with suffix trees. The k I am thinking of using is 32.

Are there any programs which can accomplish this for me? I'd rather not re-invent the wheel. I'd even settle for a program which can give me a list of k-mers present in a single data set.

Cheers!

assembly comparison • 3.3k views

ADD COMMENT • link updated 6.1 years ago by Biostar 20 • written 12.9 years ago by David M ▴ 580

score 4 · Answer 1 · 2012-02-15

4

Entering edit mode

12.9 years ago

Damian Kao 16k

I like using [?]jellyfish[?] for counting kmers. It's pretty fast.

ADD COMMENT • link 12.9 years ago by Damian Kao 16k

0

Entering edit mode

This worked perfectly, and pretty fast as well. Thanks!

ADD REPLY • link 12.8 years ago by David M ▴ 580

score 1 · Answer 2 · 2012-02-24

I would start with looking at Tallymer, which is a part of genometools. Tallymer will allow you to create a persistent index from one set and compare the occurrence ratio of k-mers from another set. I do a lot of k-mer work, and this is a great program that is well-documented.

Vmatch is a really powerful and versatile tool for any sequence comparison task. While I have never used it, the Unwords software seems like it might be the right tool for this job. These two software packages are related (same author) though, based on the Unwords publication, the data structures utilized by Unwords are completely different from Vmatch.

To add one more, I occasionally use meryl, which is really fast. This one has almost no documentation and I always have to read the C code to figure out the invocation, that is why I listed it last (not to say it's inferior, but you'll get going a lot faster with the other tools).