Question

GATB-core kmer couting

0

Entering edit mode

5.4 years ago

elebanjar ▴ 10

I recently started using the GATB-core library for counting kmers in reads. Similar to the example code given in "kmer9.cpp" in the Git-Repo, I'm using SortingCountAlgorithm for counting the kmers. Now my (very basic) question: given a specific kmer sequence, is there any way to directly look up its abundance computed by the algorithm (or do I need to iterate through the computed [kmer, abundance] pairs until I find the kmer in question)? Thanks in advance!

gatb gatb-core kmer-counting • 1.3k views

ADD COMMENT • link 5.4 years ago by elebanjar ▴ 10

score 3 · Accepted Answer · 2019-07-03

3

Entering edit mode

5.4 years ago

Rayan Chikhi ★ 1.5k

Hi,

Yes it's possible in GATB but you'd need to build a de Bruijn graph first. See this example: https://github.com/GATB/gatb-core/blob/master/gatb-core/examples/debruijn/debruijn26.cpp

Note that this mechanism doesn't allow to determine if a k-mer is truly in the graph or not. GATB will return the correct abundance only if the k-mer was previously present in the sample the graph was constructed from.

best,

Rayan

ADD COMMENT • link 5.4 years ago by Rayan Chikhi ★ 1.5k

0

Entering edit mode

Thank you for the quick reply, that helps already! In my setting, I don't know beforehand whether a specific kmer would be present in the reads (i.e. the graph), since I have a fixed set of kmers for which I want to know how often they occur in the reads. Using the approach you suggested, is there a way to check if a kmer sequence is present in the graph to make sure I only look up abundances for those that are actually in the graph?

ADD REPLY • link 5.4 years ago by elebanjar ▴ 10

0

Entering edit mode

If you can tolerate that some of the answers for query k-mers will be wrong: then you can use GATB as-is and it will often return the right answer, but with a small probability (can be tuned to be arbitrarily very small) GATB will return that a k-mer is present in the graph when in fact it is not.

If you need an exact answer for each query (i.e. cannot tolerate any mistake): unfortunately GATB is made such that it's memory-efficient and we thus didn't implement exact graph membership queries. Because doing so would make it significantly more memory-intensive. I can recommend an alternative: constructing a hash table of all k-mers, using e.g. Jellyfish, see https://github.com/gmarcais/Jellyfish/tree/master/examples/jf_count_dump

ADD REPLY • link 5.4 years ago by Rayan Chikhi ★ 1.5k

1

Entering edit mode

Ok, I see. Indeed the Jellyfish approach you suggested was exactly what I was looking for. Thanks again for your help!

ADD REPLY • link 5.4 years ago by elebanjar ▴ 10