I've been looking at the snippets and examples in the source code but none of them mention using paired read information. Is it possible with GATB to create graphs that are decorated using paired read information?
I've been looking at the snippets and examples in the source code but none of them mention using paired read information. Is it possible with GATB to create graphs that are decorated using paired read information?
Hello,
It is not directly possible to support paired reads from the graph API of GATB-CORE.
However, we have recently added the possibility to decorate the nodes of the graph with any kind of information, so it would be possible (after graph creation) to map information from the reads (paired reads in your case) to the nodes of the de Bruijn graph.
This new feature is done by using a minimal perfect hash function library, EMPHF; such a hash function takes about 2.61 bits per node (to be added to the about 8.6 bits per node for storing the de Bruijn graph). There is of course to add N bits per node, where N is the information you want to decorate the nodes with.
If you are interested, I could add some snippets showing how to decorate the nodes with information from paired reads (assuming that two consecutive reads makes a pair)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Erwan, let's make the mphf announcement more "official". it deserves more exposure. Gatb-core release 1.0.5 from sept 9 doesn't seem to include your latest patches, so I didn't want to talk about it yet. Let me know when 1.0.6 is released and i'll make a proper biostar post
cts asked a very good question. My take on it is that de bruijn graphs are typically not decorated with reads, even unpaired ones, as naively storing a read-node association could be quite memory-expensive. (naively, all kmers from a read would have to be associated to the ID of that read; assuming a billion reads and 100 kmers per read, 4 bytes per read ID, that's at least 400 GB of memory just for this association)
That's essentially the reason why Velvet, which does store read info in the graph, is not really memory-efficient, and SOAPdenovo's main improvement upon it was to remove most of the read tracking from the in-memory graph.
There has been research to create dBGs that incorporate paired read information more cleverly, but to the best of my knowledge, they haven't scaled to large genomes.
That being said, many assemblers still manage to go back to the reads after having constructed the dBG. It is often done by mapping the read to a condensed graph, where all simple paths are replaced by single nodes. GATB does not support a condensed graph data structure. (The BCALM/dbgfm suite might be better suited for this task, see this paper)
Anyhow, the GATB philosophy is that you can already achieve many tasks with a simple dBG, one that is not annotated with reads. If this is not enough, one can add a post-processing step that map the reads back to sequences constructed from the dBG (similar idea from the previous paragraph). E.g. the discoSNP tool follows this idea (one module implements a dBG and outputs results, another module checks/annotates the results with the reads).
An alternative route is to decorate the GATB graph with custom information using the EMPHF, that's a new feature that we'll announce shortly but Erwan gave you a preview :) Although it's not a silver bullet, as I said above, storing a read-node association could be quite expensive.
Thank you Rayan for your detailed comments
Yes some snippets on decorating nodes with information would be much appreciated, thank you