Question

hash function for DNA sequences

1

Entering edit mode

8.7 years ago

midox ▴ 290

hello, i need a good hash function for DNA sequences. i cant found in web. can you propose me a good and efficient hash function??

Thank you

hash hashing DNA • 13k views

ADD COMMENT • link 8.7 years ago by midox ▴ 290

4

Entering edit mode

What for, what is your type of application? What's relevant for efficiency for you? Computational complexity, time, memory, cryptographic quality, expected number of collisions? What is your key size, input size? Specific programming languages?

ADD REPLY • link 8.7 years ago by Michael 55k

2

Entering edit mode

I want to store in a hash table a k-mers. I want a good hash function in time and memory. I do not know yet the number of colisions. I program in C. thanks.

ADD REPLY • link 8.7 years ago by midox ▴ 290

2

Entering edit mode

If you just want to store kmers in a table, don't hash them. Just encode them to a number A → 0, C → 1, … T → 3 and weight each digit accordingly.

ADD REPLY • link 8.7 years ago by kloetzl ★ 1.1k

0

Entering edit mode

no, i need to hash them because i want to compare sequences using the hash table. for this, i need a good hash function.

ADD REPLY • link 8.7 years ago by midox ▴ 290

3

Entering edit mode

The trivial hash is also a perfect one, it won't have collisions, so you can compare sequences. Also it is normally also cheapest to compute. If you only want to encode ACGT (no ambiguity), each base can be encoded as 2-bit. That means that 32 bits could address all 16 mers, and a 64-bit integer k = 32

ADD REPLY • link 8.7 years ago by Michael 55k

0

Entering edit mode

maybe i choose my k=8. can you explain me what's a trivial hash?? thanks

ADD REPLY • link 8.7 years ago by midox ▴ 290

3

Entering edit mode

Take for instance the kmer K=GTCGAATC (size 8)

If you take by convention A=0, C=1, G=2, T=3, you can encode K as a polynomial in base 4 like this:

K = G.4^0 + T.4^1 + C.4^2 + G.4^3 + A.4^4 + A.4^5 + T.4^6 + C.4^7
K = 2*4^0 + 3*4^1 + 1*4^2 + 2*4^3 + 0*4^4 + 0*4^5 + 3*4^6 + 1*4^7
K = 28830

You can thus encode any 8-mer as a unique integer value.

ADD REPLY • link 8.7 years ago by edrezen ▴ 730

0

Entering edit mode

ok thank you for your explanations. if i have collisions, what do I have in this case? because in my sequences i have ~15% of errors. Thanks

ADD REPLY • link 8.7 years ago by midox ▴ 290

2

Entering edit mode

With this technique the hash is bijective and thus no collisions occur for fixed k.

ADD REPLY • link 8.7 years ago by kloetzl ★ 1.1k

0

Entering edit mode

ok.if i have collisions? what do I have in this case? because error cause collisions!

ADD REPLY • link 8.7 years ago by midox ▴ 290

2

Entering edit mode

I think this is a misunderstanding. Sequencing errors do not cause hash collisions. A collision in the sense of the topic would be like h(s): S -> H , there exists s, v in S, s != v AND h(s) = h(v)

A seq. error causes a sequence to contain a 'wrong' k-mer (ACgT) while the 'true' sequence is (ACAT). You have no chance to differentiate the two by means of a hashing function.

ADD REPLY • link 8.7 years ago by Michael 55k

0

Entering edit mode

have you some papers for hash that use nucleotides ??

ADD REPLY • link 8.7 years ago by midox ▴ 290

0

Entering edit mode

Look for the khmer citation.

ADD REPLY • link 8.7 years ago by Michael 55k

0

Entering edit mode

i see the citation but i not found!! have you another?? Thanks

ADD REPLY • link 8.7 years ago by midox ▴ 290

0

Entering edit mode

what do u mean by 'i not found'? the paper is there, if u need more try the other thread or any of the papers in the references.

ADD REPLY • link 8.7 years ago by Michael 55k

1

Entering edit mode

If you truly need a hash then I've found murmur3 to be quick and easy to use in C.

ADD REPLY • link 8.7 years ago by Devon Ryan 104k

0

Entering edit mode

i dont undestand the fonctionnality of Murmurhash. can you simplify me ?? THanks

ADD REPLY • link 8.7 years ago by midox ▴ 290

1

Entering edit mode

It's a generic hash function, so you can support arbitrarily long kmers (or anything else, for that matter).

ADD REPLY • link 8.7 years ago by Devon Ryan 104k

0

Entering edit mode

ok thanks. do i need to undestand the algorithm? because in https://en.wikipedia.org/wiki/MurmurHash I did not understand the algorithm or the code! can you explain me if you can ?

ADD REPLY • link 8.7 years ago by midox ▴ 290

2

Entering edit mode

Don't bother trying to understand the hash function, just use it if you need it. This has good performance and is fairly widely used. I've used it in conjunction with the 2-bit encoding that Michael and the others have mentioned (i.e., encoding an A as 0, C as 1, and so on) and been happy with the results. Certainly if you can get away with just using the 2-bit encoded value then do it, you're unlikely to get better performance than that.

ADD REPLY • link 8.7 years ago by Devon Ryan 104k

0

Entering edit mode

OK thankyou. Just a few questions: what is the length of k-mers that can support MurMur ?? and can i use it on nucleotides ?? thanks for your responses.

ADD REPLY • link 8.7 years ago by midox ▴ 290

1

Entering edit mode

You can use any length of anything, it's a generic hashing function.

ADD REPLY • link 8.7 years ago by Devon Ryan 104k

0

Entering edit mode

in the focntion of MurMurHash there are (key, len, seed) the Key is the string , len is the lenght and what's seed? Thanks

ADD REPLY • link 8.7 years ago by midox ▴ 290

2

Entering edit mode

Seed is a random value that's used in the hashing. I just use uint32_t seed = 0xAAAAAAAA;.

ADD REPLY • link 8.7 years ago by Devon Ryan 104k

0

Entering edit mode

if i use machine 64 what's the random value can i use?

ADD REPLY • link 8.7 years ago by midox ▴ 290

2

Entering edit mode

There's no difference between 32/64 bit machines with this, you can use the same value (e.g., 0xAAAAAAAA).

ADD REPLY • link 8.7 years ago by Devon Ryan 104k

1

Entering edit mode

On a 64bit machine, the most random value is 0x8000000000000000, plus or minus 0x8000000000000000. My personal fav is 0x0123456789ABCDEF, but dont tell anyone because every time some uses it, it becomes less random.

ADD REPLY • link 8.7 years ago by John 13k

1

Entering edit mode

We need to add <sarcasm> tags to biostars.

ADD REPLY • link 8.7 years ago by Devon Ryan 104k

0

Entering edit mode

Hahaha, yes :) Or perhaps just a catch-all <bad advice> tag that I could encapsulate all my posts in :P

ADD REPLY • link 8.7 years ago by John 13k

0

Entering edit mode

it's logic to have a negative values by murmur?? because I have positive values and negative values !!!

ADD REPLY • link 8.6 years ago by midox ▴ 290

1

Entering edit mode

Presumably you need to recast to an unsigned integer. My code for this is:

uint64_t hashString(char *s) {
    int len = strlen(s);
    uint64_t hash_val[2];
    uint32_t seed = 0xAAAAAAAA;
#if UINTPTR_MAX == 0xffffffff
    MurmurHash3_x86_128((void *) s, len, seed, (void *) &hash_val);
#else
    MurmurHash3_x64_128((void *) s, len, seed, (void *) &hash_val);
#endif
    return hash_val[0];
}

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

no i use machine 32 bit. i use the code in wikipedia website

ADD REPLY • link 8.6 years ago by midox ▴ 290

0

Entering edit mode

Then use a uint32_t rather than an int32_t.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

yes in my function i used uint32_t but i have negative value. I think these are a false results???

ADD REPLY • link 8.6 years ago by midox ▴ 290

1

Entering edit mode

I guess you are printing you results wrong, because the u in uint32_t means unsigned i.e. no negative values.

ADD REPLY • link 8.6 years ago by kloetzl ★ 1.1k

0

Entering edit mode

That's not even a theoretical possibility. Presumably you're converting that to a signed value at some point.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

no, my value is like this in C code: uint32_t Value=0; Value=murmur('ACTGCAC',7,0xAAAAAAAA); printf("%d",Value); and here the result is negative value!!!

ADD REPLY • link 8.6 years ago by midox ▴ 290

1

Entering edit mode

printf("%"PRIu32, Value);

You're converting an unsigned 32-bit integer to a signed int in what you're doing...

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

1

Entering edit mode

I wish I had your patience. ☺

ADD REPLY • link 8.6 years ago by kloetzl ★ 1.1k

0

Entering edit mode

what's printf("%"PRIu32, Value); it doesn't work in C!

ADD REPLY • link 8.6 years ago by midox ▴ 290

0

Entering edit mode

See here and here.

ADD REPLY • link 8.6 years ago by kloetzl ★ 1.1k

0

Entering edit mode

I'm somewhat amazed that you got things compiled without importing inttypes.h...

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

no we juste use "%u" for unseigned. thanks

ADD REPLY • link 8.6 years ago by midox ▴ 290

0

Entering edit mode

Hello Devon.

Could you please share what the hash compression function do you use in order to assign and index for the hash value returned by murmur? What size of the hash table do you use?

Thanks in advance.

ADD REPLY • link 7.7 years ago by germelcar ▴ 20

1

Entering edit mode

So, fixed length input, and a hash table, I conclude you should optimize for lookup time/hash compute time, that would be the trivial hash algorithm, means you won't need any complex hashing function. Convert the k-mer into a 32-bit integer (if it fits) value and use the number to address an array...

ADD REPLY • link 8.7 years ago by Michael 55k

1

Entering edit mode

The up2bit format converts every sequence of DNA into a unique number, much like the 2bit format suggested by kloetzi, except the up2bit format costs 2 bits more, and allows you to store DNA of a variable length. It is stable over different struct sizes/types, (stuffing it into a signed or unsigned int wont matter, 'ATCG' has the same value in int16 as it does in int128, etc), and since its just bit shifting its faster than polynomial math.

However, its not a one-way hashing function since it can be reversed, so perhaps you were looking for another one of the many things Michael points.

code in python:

def dna_up2bit(DNA):
    up2bit = 1
    for twobit in [ ('A','C','T','G').index(char) for char in reversed(DNA) ]:
        up2bit = (up2bit << 2) + twobit
    return up2bit

ADD REPLY • link 8.7 years ago by John 13k

1

Entering edit mode

See MinHash

ADD REPLY • link 8.7 years ago by 5heikki 11k

1

Entering edit mode

See also: Is There A Fast Hashing Function For Nucleotide K-Mers (Q-Grams)?

ADD REPLY • link 8.7 years ago by Michael 55k

1

Entering edit mode

That thread points out that if you want strand-unspecific hashing, make A/T 0 and C/G 1. I think thats pretty awesome :D Dump twice as much DNA in each hash.

ADD REPLY • link 8.7 years ago by John 13k

0

Entering edit mode

Are the generic functions (e.g., murmur hash) not sufficing for you?

ADD REPLY • link 8.7 years ago by Devon Ryan 104k

0

Entering edit mode

Hello, I have a problem the hash function Murmur (). if I take k = 3 (lenght), the hash function normally gives values between 0 and 255. but I have values in millions and billions. Does this make sense?

ADD REPLY • link 8.6 years ago by midox ▴ 290

1

Entering edit mode

If you have more specific problems about implementation of this algorithm I would recommend to move over to stackexchange. However, this is a good hint as to why the use of a real hash is overkill, there are 64 unique 3-mers, why use the long integer to store them? The hash function is for arbitrary length input, so it doesn't take the k-mer length into account.

ADD REPLY • link 8.6 years ago by Michael 55k

0

Entering edit mode

Sure that makes sense, the values should be distributed over the full range of possible integers.

Of course if everything is of length 3 then just do the 2-bit trick that everyone else mentioned. That'll be faster. It's when the length is too long or you have variable sized inputs that you need a generic hash function.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

1

Entering edit mode

The up2bit code I pasted above does variable-length inputs :( I think the only time you need a generic hash-function when the OP asks for a generic hash function :P

ADD REPLY • link 8.6 years ago by John 13k

0

Entering edit mode

So what's the solution for DNA k-mer if a i have a small k. do i change the hash function???

ADD REPLY • link 8.6 years ago by midox ▴ 290