Question

Genomic Dna Hash Tables And Ambiguous Bases

5

Entering edit mode

14.7 years ago

Gww ★ 2.7k

Hi,

I'm hoping to implement a genomic DNA hash-table and I unsure how to handle N bases?

Should I skip the k-mers that contain them? or generate the possible sequences up to a limit of X N's per k-mer?

EDIT:

I'm hoping to use the hash table to find perfect k-mer matches within the human genome (I'm aiming for a k-mer size of 10-12 nucleotides). My query sequences won't have ambiguous bases so I'm not too worried about dealing with those. I assume the best strategy is to just skip sequences that have N's.

Thanks

alignment genomics • 5.7k views

ADD COMMENT • link updated 14.7 years ago by Gingi ▴ 330 • written 14.7 years ago by Gww ★ 2.7k

score 4 · Answer 1 · 2010-11-14

4

Entering edit mode

14.7 years ago

Gingi ▴ 330

Yes, if you're looking for perfect matches, don't index kmers that contain Ns.

Are you coding the kmer index yourself? You might want to take a look at Tallymer, which creates an index similar to what you have in mind.

ADD COMMENT • link 14.7 years ago by Gingi ▴ 330

1

Entering edit mode

Thanks for the advice, I want to code it myself mostly for the learning experience (I haven't written a hash table before). But that link will be really helpful.

ADD REPLY • link 14.7 years ago by Gww ★ 2.7k

Ram · Answer 2 · 2010-11-14

2

Entering edit mode

14.7 years ago

brentp 24k

You'll probably get more useful answers if you indicate your intended use of the hash-table.

Meanwhile check this thread to see how existing software handles the problem.

As another datapoint bowtie just treats non-ACGT characters in a read as mismatches--but it's not using a hash.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 14.7 years ago by brentp 24k

1

Entering edit mode

Strictly speaking, bowtie treats an ambiguous base as a random base in mapping. It corrects for that afterwards, but this is different from building the ambiguity in the index.

ADD REPLY • link 14.7 years ago by lh3 33k

0

Entering edit mode

Thanks for the answer, I updated my question with a bit more information regarding my goals. Oh PS. I really enjoyed your blog article on bloom filters :).

ADD REPLY • link 14.7 years ago by Gww ★ 2.7k

0

Entering edit mode

@lh3, aye, but that's in the reference. at least according to the docs:"Ambiguous characters in the read mismatch all other characters."

ADD REPLY • link 14.7 years ago by brentp 24k