Is There A Fast Hashing Function For Nucleotide K-Mers (Q-Grams)?
4
17
Entering edit mode
13.5 years ago
Ketil 4.1k

I would like to index all k-mers in a set of nucleotide sequences. I could use a generic string based hash function, but my experiments indicate that a function leveraging the fact that the k-mers are overlapping might be faster. Also, I'd like a k-mer and its reverse complement to hash to the same value.

To be precise, I am not looking for hash table implementations, or applications, just a fast function from k-mer to a hash value, so that the value for a k-mer and its reverse complement is the same.

I can invent my own function for this (I have, as a matter of fact), but surely, something like this exists already?

index • 17k views
ADD COMMENT
0
Entering edit mode

have you any solution for this? what's the hashing functions for DNA k-mers?? Thanks

ADD REPLY
10
0
Entering edit mode

Fun to see bloom filters gaining some use! This is in fact what I want the hashing function for (although it will be rehashed by some other function later).

ADD REPLY
0
Entering edit mode

khmer uses darcs as its RCS. Yay!

ADD REPLY
0
Entering edit mode

FYI: The correct url for the khmer project is https://github.com/ged-lab/khmer

ADD REPLY
5
Entering edit mode
13.5 years ago
Micans ▴ 270

I have used the simple map from A,C,G,T onto {00, 01, 10, 11} (in bits), allowing four bases encoded in a byte. You could then use hash(word) + hash(rc(word)), or any other symmetric function. This approach has the advantage that you can use the current hash to find the next hash; as you shift bases in and out of your k-mer, you shift bits in and out of your hash or hash constituents. This is the approach taken in sylamer, and it's pretty fast.

ADD COMMENT
1
Entering edit mode

I guess this is a variation of rolling hash? Can you give more specific description on implementation? Thanks!

ADD REPLY
0
Entering edit mode

Still here? :-) Yes, I maintain the current hash in forward and reverse complement, shift left/right and chop off the end, add next base, and store the numerically smallest hash. Code at https://github.com/ketil-malde/kmx.

ADD REPLY
0
Entering edit mode

I used this approach recently in my final year project and my lecturer was left speechless at the speed/accuracy of my hash function. Not sure why this isn't a standard!

ADD REPLY
0
Entering edit mode

If you want identical hashes for reverse-complements, couldn't you just use 0 for A/T and 1 for G/C, instead of using two bits per nucleotide?

ADD REPLY
0
Entering edit mode

Yes, I've used this too. The main problems with this is that you risk overflowing a fixed size data type (so k must not be greater than bits/2). For large k, you need to use an arbitrary precision integer, which slows things down. Also, the incremental calculation of next hash involves two divisions and some other arithmetic. In short - I think it's possible to do better.

ADD REPLY
0
Entering edit mode

Yes, those limitations are evident. In the context of Sylamer we do not support K-mers of length > 16 currently, but on 64-bit machines you can support words of lenght 32 natively. For large k I do not use all bits and resort to true hashing (the size of the hash array defined by the amount of sequence to be analysed). This solution obviously fits some applications and not others, but in terms of speed I will be well impressed if you can do substantially better than 'hid = ((hid << 2) | b) & LOMEGA(K)' (no divisions).

ADD REPLY
0
Entering edit mode

No, that's pretty good. I though that to support simultaneous calculation of the reverse complement hash you'd need a division. But now I think you can do that too with shifts and bitwise ops. Good catch!

ADD REPLY
0
Entering edit mode

I did an implementation of this in Haskell - normally, I'd expect a high level language to be less efficient for this, but it turns out it is fast enough (meaning that I haven't found an associative data structure that won't be dramatically slower than the hashing). I can hash about 40MB/s on my laptop, doing both forward and reverse complement, and avoiding Ns. Almost twice as fast without those restrictions.

Oh, and I map A: 00, C: 11, G: 01, T:10 - this makes it easy (bitwise operations) to complement and count GC content, and differentiate pyrimidines and purines.

ADD REPLY
2
Entering edit mode
13.5 years ago
Drio ▴ 920

Take a look to these benchmark results. Particularly the last benchmark, dict. It compares different hashing libraries in different programming languages.

If you want speed, use khash. That's the hash table implementation used in the c entry for that benchmark.

Also check this post. It has a very nice review of C/C++ hash table libraries.

ADD COMMENT
0
Entering edit mode

Thanks - but I don't really want a hash table implementation, I just what a hashing function.

ADD REPLY
0
Entering edit mode
13.5 years ago
Jts ★ 1.4k

I have used one of Bob Jenkin's hash functions (lookup3/8) for k-mer hashing with good results:

http://burtleburtle.net/bob/hash/

I have also heard good things about murmer hash but I have not tried it myself:

http://code.google.com/p/smhasher/

As for dedicated k-mer hashes see this paper by Guillaume Marçais and Carl Kingsford:

http://bioinformatics.oxfordjournals.org/content/27/6/764

In section 3.4 they discuss their design of a k-mer specific hashing function.

ADD COMMENT
0
Entering edit mode

What hash compression technique do you use for the hash values returned by lookup3/8? How do you determine the hashtable size?

Thanks in advance.

ADD REPLY

Login before adding your answer.

Traffic: 1679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6