First, thank you for developing the excellent looking GATB library. I'm excited to use it to test out some new ideas, but I had a few basic questions about using the de Bruijn graph. First, I was curious how I retrieve a de Bruijn graph node given the k-mer to which it corresponds. I looked through the documentation, but I couldn't really find any such ability. Does the de Brujin graph in GATB support this (looking up a node by its k-mer?). Second, the release notes for the new version mention that it is now possible to store arbitrary information at node vertices, but I couldn't seem to dig up how to do this in the documentation. Are there any pointers to this?
Thank you for your interest in GATB. Concerning your questions:
Our implementation of the de Bruijn graph implies that you normally can't get a Node instance of the graph from an arbitrary kmer value; in fact, you can (a) iterate all the nodes of a given graph (from the information stored in a HDF5 file), and (b) get the neighbors of a given node N from a Bloom filter based structure in memory (but you have to be sure that node N is in the graph, so it should have been got from the all nodes iteration).
The only possibility is when you are sure that some kmer value corresponds to a node of your graph; then you can use the Graph::buildNode method that will return a Node instance from a kmer as an ASCII string (see here). It is also possible to get a Node instance from the integer value of a kmer; there is no snippet showing this but I can provide some example if you are interested.
If you really want to know if a kmer belongs to a graph, you must make yourself a lookup in the kmers stored in the HDF5 file. Although it looks tedious and time consuming, it must be noticed that kmers in the HDF5 files are stored in partitions (so one can look directly in the good partition from the query kmer) and that kmers of a partition are sorted which makes the lookup easy. I think I should add some snippet showing how to do this because some people could really need it.
You can indeed tag information for any node now; we use the very nice EMPHF library for doing it. The idea is to get a unique integer in interval [0,|G|-1] (where |G| is the number of nodes of the graph) from a kmer value of a node; then you can allocate an array of size |G| of any type of your choice and use the EMPH hash function to get the index of the array to be used. Right now, the documentation may lack some examples; you can have a look at the Graph::queryAbundance method (file Graph.cpp) that shows how to get the coverage of an arbitrary node. Again, I think we should add some snippet about this topic. Note: if you want get the EMPH hash function during the graph building, you have to use -mphf emphf (see here)
Thanks for your quick response! So, basically I'm doing some experimentation with read-mapping ideas, and so I want to be able to query for k-mers from a read in the de Bruijn graph, and this operation should be as fast as possible. What is your recommendation for achieving this? The naive approach that first comes to my mind is to keep a e.g. hash map from k-mers to de Bruijn node, or at least a hash set of the nodes actually in the graph so that you could check the set and then use the strategy you mention above to retrieve a node when you are sure it exists. However, this seems like it would result in quite a large memory overhead (since you'd basically be storing the node information of the de Bruijn graph all over again). Any suggestions you have on this front would be greatly appreciated!
ADD REPLY
• link
updated 2.1 years ago by
Ram
44k
•
written 9.4 years ago by
Rob
6.9k
0
Entering edit mode
Also, I'm curious how I would get the actual integer that emph associates with each node. This seems like it would be very useful, as it would allow you to directly look up any meta-information about a node that you want. I see the example about abundance that you pointed out in the documentation, but it's not clear (to me) how to extend this to arbitrary user-supported data (since e.g. abundance is already a member of the Node class but the data a user might wish to store is not). I see though, how if I could get the emph id for a node, I could store any extra information I wanted.
ADD REPLY
• link
updated 2.1 years ago by
Ram
44k
•
written 9.4 years ago by
Rob
6.9k
1
Entering edit mode
I forgot to mention the Graph::nodeMPHFIndex method. It think it does what you want, i.e., get an emph id from a node. It is not yet very well documented but would deserve more visibility. Note that it will return always 0 if the -mphf has not been used.
ADD REPLY
• link
updated 2.1 years ago by
Ram
44k
•
written 9.4 years ago by
edrezen
▴
730
0
Entering edit mode
Rob, a critical aspect of the dBG in GATB is that you can test membership of a node in the graph but there will be false positives. The function Graph::contains(Graph::buildNode(kmer)) can be used to test membership of an arbitrary kmer. Then, as Erwan said Graph::nodeMPHFIndex can be used to obtain an integer for that kmer. If that kmer didn't belong to the graph (i.e. it's a false positive), you'll still get an integer but it will be the result of a collision with some other true positive kmer.
Note that Graph::buildNode could be further optimized for quickly processing consecutive kmers in a read. Let us know if you're using this solution, we then could optimize it for you.
I appreciate that you're looking into GATB though :) The philosophy of that library was that the graph could easily be constructed and traversed (and now, arbitrary information can be added to nodes via the integer mapping; also, nodes can be deleted), but arbitrary membership queries wouldn't be done. Although, such a use case mustn't be excluded..
(Rob, if you can, please tag this post with "gatb" so that other gatb developers/users will see it)
If I wanted to create a super-fast strategy for exact kmer membership queries at the expense of a significant memory overhead, I'd do it along these lines:
Preprocessing: store all nodes in the graph in an array, indexed by their MPHF index.
(extra memory usage: number of kmers * 2 * k bits. So, that's roughly 23 GB for human and k=31)
bool membership(KmerType kmer)
{
uint64_t I = Graph::nodeMPHFIndex(Graph::buildNode(kmer));
// that buildNode call could be easily removed later
return kmer == kmers[i];
}
Just to be curious, is it mandatory for you not to get false positives? Membership (with possible false positives) is done in O(1) in GATB (query of a Bloom filter) so is efficient.
I think that FP rates is today about a few percent. It could be reduced by the following trick (not tested): for the query kmer K, get its successors and for each successor, get its predecessors; in case of true positive, we must have K in the predecessors of its successors. It should avoid some FP by the fact that neighborhood queries use a critical false positive set. This trick needs some computation but no memory; it is "almost" in O(1) ("almost" because sometimes a log(N) operation has to be done on the set)
Rayan, do you think it would be worthy to benchmark such a "modified" FP rate by using this trick, compared to the actual FP rate due to Bloom filter usage?
ADD REPLY
• link
updated 2.1 years ago by
Ram
44k
•
written 9.3 years ago by
edrezen
▴
730
0
Entering edit mode
Erwan, thanks for the suggestion. Well, it might be, indeed; let's see what Rob thinks about false positives.
I'd like to see an actual application before going in that direction.
Thanks for the useful replies! To give a little more information, my intended usage for the RapMap project. It's a "ridiculously" fast lightweight/pseudo aligner for mapping reads to transcriptomes (the lightweight/pseudo part means that it yields only transcripts and positions, and doesn't compute actual optimal alignments). The index for this project is based on the idea presented in the Kallisto pre-print. Basically, you build a transcript De Bruijn graph, and you label each contig as an equivalence class over k-mers (the label is the transcripts in which these k-mers appear). In addition to the equivalence class, you also record, for each k-mer, information that allows you to compute the distance (in # of k-mers) in the forward and reverse directions until the equivalence class changes. The idea here is that, when mapping a read, only k-mers from a different equivalence class are informative, so you can skip hashing many k-mers.
So, for my usage purposes, it is fairly important to not have false positives. However, it's equally important that, given a k-mer, I be able to look up information about it in auxiliary tables (hence my interest in the MPHF). My current solution relies on using the Jellyfish hash as a library, since it provides a reasonable speed hash with good memory usage (much less, e.g. than std::unordered_map and even google's sparse_hash). However, I'm on the lookout for things that are either faster at a similar size, or a similar speed but smaller. One of the great things about GATB (apart from being well-documented, well-engineered and seemingly efficient) is that all of the primitives for traversing the dBG would make the index building phase simpler and potentially more efficient.
ADD REPLY
• link
updated 2.1 years ago by
Ram
44k
•
written 9.3 years ago by
Rob
6.9k
0
Entering edit mode
Hi Rob,
Thanks for providing the context! RapMap looks like a promising approach.
Indeed, since you're currently using Jellyfish (which stores keys explicitly), then a solution based on MPHF along with storing the keys could be competitive both in terms of memory (it's essentially a hash table with load factor of 1) and speed (no rehashing). I'd be happy to assist if you're thinking of creating a GATB branch of RapMap.
Additionally, our current development version of GATB has graph simplification functions (deleting tips and/or bubbles, corresponding to errors/SNPs, resp.). Perhaps these could also help? If you're creating contigs, then you may not want two simple paths, separated by only a branch leading to a sequencing error, to belong to two different equivalent classes.
Anyhow, more discussion can either take place here, or by email : best way to reach all GATB devs is: gatb-tools-support@lists.gforge.inria.fr
Great! I'd definitely be interested in a GATB branch of RapMap. I'll get in touch via the e-mail list. The current mapper assumes the given reference transcriptome is "infallible", but I like your suggestion of optionally correcting for potential assembly errors.
--Rob
ADD REPLY
• link
updated 2.1 years ago by
Ram
44k
•
written 9.3 years ago by
Rob
6.9k
0
Entering edit mode
Ohh I misunderstood and thought you were constructing a de Bruijn graph of reads. But it's actually a de Bruijn graph of a reference transcriptome, right? It makes less sense to correct it then.
If it's a fixed set of transcripts, then a MPHF sounds even better, as it takes some time to construct (still not so long though) but should be the ideal method for hashing a static set of kmers. Jellyfish structure is, I think, optimized for inserting new elements in reasonable time; which you might not need here?
Although in all fairness, Jellyfish table is, as you said, quite impressive.
Thanks for your quick response! So, basically I'm doing some experimentation with read-mapping ideas, and so I want to be able to query for k-mers from a read in the de Bruijn graph, and this operation should be as fast as possible. What is your recommendation for achieving this? The naive approach that first comes to my mind is to keep a e.g. hash map from k-mers to de Bruijn node, or at least a hash set of the nodes actually in the graph so that you could check the set and then use the strategy you mention above to retrieve a node when you are sure it exists. However, this seems like it would result in quite a large memory overhead (since you'd basically be storing the node information of the de Bruijn graph all over again). Any suggestions you have on this front would be greatly appreciated!
Also, I'm curious how I would get the actual integer that emph associates with each node. This seems like it would be very useful, as it would allow you to directly look up any meta-information about a node that you want. I see the example about abundance that you pointed out in the documentation, but it's not clear (to me) how to extend this to arbitrary user-supported data (since e.g. abundance is already a member of the Node class but the data a user might wish to store is not). I see though, how if I could get the emph id for a node, I could store any extra information I wanted.
I forgot to mention the
Graph::nodeMPHFIndex
method. It think it does what you want, i.e., get an emph id from a node. It is not yet very well documented but would deserve more visibility. Note that it will return always 0 if the-mphf
has not been used.Rob, a critical aspect of the dBG in GATB is that you can test membership of a node in the graph but there will be false positives. The function
Graph::contains(Graph::buildNode(kmer))
can be used to test membership of an arbitrary kmer. Then, as Erwan saidGraph::nodeMPHFIndex
can be used to obtain an integer for that kmer. If that kmer didn't belong to the graph (i.e. it's a false positive), you'll still get an integer but it will be the result of a collision with some other true positive kmer.Note that
Graph::buildNode
could be further optimized for quickly processing consecutive kmers in a read. Let us know if you're using this solution, we then could optimize it for you.I appreciate that you're looking into GATB though :) The philosophy of that library was that the graph could easily be constructed and traversed (and now, arbitrary information can be added to nodes via the integer mapping; also, nodes can be deleted), but arbitrary membership queries wouldn't be done. Although, such a use case mustn't be excluded..
(Rob, if you can, please tag this post with "gatb" so that other gatb developers/users will see it)
If I wanted to create a super-fast strategy for exact kmer membership queries at the expense of a significant memory overhead, I'd do it along these lines:
Preprocessing: store all nodes in the graph in an array, indexed by their MPHF index.
(extra memory usage:
number of kmers * 2 * k
bits. So, that's roughly 23 GB for human and k=31)Rob,
Just to be curious, is it mandatory for you not to get false positives? Membership (with possible false positives) is done in O(1) in GATB (query of a Bloom filter) so is efficient.
I think that FP rates is today about a few percent. It could be reduced by the following trick (not tested): for the query kmer K, get its successors and for each successor, get its predecessors; in case of true positive, we must have K in the predecessors of its successors. It should avoid some FP by the fact that neighborhood queries use a critical false positive set. This trick needs some computation but no memory; it is "almost" in O(1) ("almost" because sometimes a log(N) operation has to be done on the set)
Rayan, do you think it would be worthy to benchmark such a "modified" FP rate by using this trick, compared to the actual FP rate due to Bloom filter usage?
Erwan, thanks for the suggestion. Well, it might be, indeed; let's see what Rob thinks about false positives.
I'd like to see an actual application before going in that direction.
Hi Ryan and Erwan,
Thanks for the useful replies! To give a little more information, my intended usage for the RapMap project. It's a "ridiculously" fast lightweight/pseudo aligner for mapping reads to transcriptomes (the lightweight/pseudo part means that it yields only transcripts and positions, and doesn't compute actual optimal alignments). The index for this project is based on the idea presented in the Kallisto pre-print. Basically, you build a transcript De Bruijn graph, and you label each contig as an equivalence class over k-mers (the label is the transcripts in which these k-mers appear). In addition to the equivalence class, you also record, for each k-mer, information that allows you to compute the distance (in # of k-mers) in the forward and reverse directions until the equivalence class changes. The idea here is that, when mapping a read, only k-mers from a different equivalence class are informative, so you can skip hashing many k-mers.
So, for my usage purposes, it is fairly important to not have false positives. However, it's equally important that, given a k-mer, I be able to look up information about it in auxiliary tables (hence my interest in the MPHF). My current solution relies on using the Jellyfish hash as a library, since it provides a reasonable speed hash with good memory usage (much less, e.g. than
std::unordered_map
and even google'ssparse_hash
). However, I'm on the lookout for things that are either faster at a similar size, or a similar speed but smaller. One of the great things about GATB (apart from being well-documented, well-engineered and seemingly efficient) is that all of the primitives for traversing the dBG would make the index building phase simpler and potentially more efficient.Hi Rob,
Thanks for providing the context! RapMap looks like a promising approach.
Indeed, since you're currently using Jellyfish (which stores keys explicitly), then a solution based on MPHF along with storing the keys could be competitive both in terms of memory (it's essentially a hash table with load factor of 1) and speed (no rehashing). I'd be happy to assist if you're thinking of creating a GATB branch of RapMap.
Additionally, our current development version of GATB has graph simplification functions (deleting tips and/or bubbles, corresponding to errors/SNPs, resp.). Perhaps these could also help? If you're creating contigs, then you may not want two simple paths, separated by only a branch leading to a sequencing error, to belong to two different equivalent classes.
Anyhow, more discussion can either take place here, or by email : best way to reach all GATB devs is: gatb-tools-support@lists.gforge.inria.fr
Hi Rayan,
Great! I'd definitely be interested in a GATB branch of RapMap. I'll get in touch via the e-mail list. The current mapper assumes the given reference transcriptome is "infallible", but I like your suggestion of optionally correcting for potential assembly errors.
--Rob
Ohh I misunderstood and thought you were constructing a de Bruijn graph of reads. But it's actually a de Bruijn graph of a reference transcriptome, right? It makes less sense to correct it then.
If it's a fixed set of transcripts, then a MPHF sounds even better, as it takes some time to construct (still not so long though) but should be the ideal method for hashing a static set of kmers. Jellyfish structure is, I think, optimized for inserting new elements in reasonable time; which you might not need here?
Although in all fairness, Jellyfish table is, as you said, quite impressive.