I want to find the number of unique n-mers in the genome, for arbitrary ns (say 30). Is there software that can do this?
If not, how could I do it without requiring excessive ram? I guess a suffix-tree would be the best option, but all the implementations I have found are either inefficient or 10 years old.
Jellyfish or kmc2? Not sure if they work with chromosome-long sequences.
I want to thank all responders and will upvote on monday when I get to my computer. I'll start with jellyfish since 2.0 accepts full genomes according to the docs ( Section 1.1.2 in the user guide: http://www.genome.umd.edu/docs/JellyfishUserGuide.pdf )
kmercountexact.sh
from BBMap may be worth a try. Some prior discussion: How to find the shortest k-mer length that is unique in a large genome.