How to find number of unique n-mers in genome?
1
0
Entering edit mode
8.6 years ago
endrebak ▴ 970

I want to find the number of unique n-mers in the genome, for arbitrary ns (say 30). Is there software that can do this?

If not, how could I do it without requiring excessive ram? I guess a suffix-tree would be the best option, but all the implementations I have found are either inefficient or 10 years old.

alignment sequence • 2.2k views
ADD COMMENT
2
Entering edit mode

Jellyfish or kmc2? Not sure if they work with chromosome-long sequences.

ADD REPLY
0
Entering edit mode

I want to thank all responders and will upvote on monday when I get to my computer. I'll start with jellyfish since 2.0 accepts full genomes according to the docs ( Section 1.1.2 in the user guide: http://www.genome.umd.edu/docs/JellyfishUserGuide.pdf )

ADD REPLY
0
Entering edit mode

kmercountexact.sh from BBMap may be worth a try. Some prior discussion: How to find the shortest k-mer length that is unique in a large genome.

ADD REPLY
1
Entering edit mode
8.6 years ago
kloetzl ★ 1.1k

A suffix array might help. Compute SA and LCP. Then do a linear scan of the LCP array for each entry i with LCP[i] < n && LCP[i+1] < n you have found a unique n-mer (given strlen(S+SA[i]) >= n). For SA construction you might want to use https://github.com/y-256/libdivsufsort

Alternatively you could apply hashing to your kmers. There are a lot of tools out there for that job. See if one fits your needs.

ADD COMMENT

Login before adding your answer.

Traffic: 2438 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6