Genomic Dna Hash Tables And Ambiguous Bases
2
5
Entering edit mode
14.0 years ago
Gww ★ 2.7k

Hi,

I'm hoping to implement a genomic DNA hash-table and I unsure how to handle N bases?

Should I skip the k-mers that contain them? or generate the possible sequences up to a limit of X N's per k-mer?

EDIT:

I'm hoping to use the hash table to find perfect k-mer matches within the human genome (I'm aiming for a k-mer size of 10-12 nucleotides). My query sequences won't have ambiguous bases so I'm not too worried about dealing with those. I assume the best strategy is to just skip sequences that have N's.

Thanks

alignment genomics • 5.2k views
ADD COMMENT
4
Entering edit mode
14.0 years ago
Gingi ▴ 330

Yes, if you're looking for perfect matches, don't index kmers that contain Ns.

Are you coding the kmer index yourself? You might want to take a look at Tallymer, which creates an index similar to what you have in mind.

ADD COMMENT
1
Entering edit mode

Thanks for the advice, I want to code it myself mostly for the learning experience (I haven't written a hash table before). But that link will be really helpful.

ADD REPLY
2
Entering edit mode
14.0 years ago
brentp 24k

You'll probably get more useful answers if you indicate your intended use of the hash-table.

Meanwhile check this thread to see how existing software handles the problem.

As another datapoint bowtie just treats non-ACGT characters in a read as mismatches--but it's not using a hash.

ADD COMMENT
1
Entering edit mode

Strictly speaking, bowtie treats an ambiguous base as a random base in mapping. It corrects for that afterwards, but this is different from building the ambiguity in the index.

ADD REPLY
0
Entering edit mode

Thanks for the answer, I updated my question with a bit more information regarding my goals. Oh PS. I really enjoyed your blog article on bloom filters :).

ADD REPLY
0
Entering edit mode

@lh3, aye, but that's in the reference. at least according to the docs:"Ambiguous characters in the read mismatch all other characters."

ADD REPLY

Login before adding your answer.

Traffic: 2649 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6