I recently started using the GATB-core library for counting kmers in reads. Similar to the example code given in "kmer9.cpp" in the Git-Repo, I'm using SortingCountAlgorithm for counting the kmers. Now my (very basic) question: given a specific kmer sequence, is there any way to directly look up its abundance computed by the algorithm (or do I need to iterate through the computed [kmer, abundance] pairs until I find the kmer in question)? Thanks in advance!
Thank you for the quick reply, that helps already! In my setting, I don't know beforehand whether a specific kmer would be present in the reads (i.e. the graph), since I have a fixed set of kmers for which I want to know how often they occur in the reads. Using the approach you suggested, is there a way to check if a kmer sequence is present in the graph to make sure I only look up abundances for those that are actually in the graph?
If you can tolerate that some of the answers for query k-mers will be wrong: then you can use GATB as-is and it will often return the right answer, but with a small probability (can be tuned to be arbitrarily very small) GATB will return that a k-mer is present in the graph when in fact it is not.
If you need an exact answer for each query (i.e. cannot tolerate any mistake): unfortunately GATB is made such that it's memory-efficient and we thus didn't implement exact graph membership queries. Because doing so would make it significantly more memory-intensive. I can recommend an alternative: constructing a hash table of all k-mers, using e.g. Jellyfish, see https://github.com/gmarcais/Jellyfish/tree/master/examples/jf_count_dump
Ok, I see. Indeed the Jellyfish approach you suggested was exactly what I was looking for. Thanks again for your help!