How to get the histogram of a graph
2
1
Entering edit mode
9.2 years ago

I would like to retrieve the abundance histogram of a graph that I am building from a set of sequences.

I have seen how I can iterate on the graph and get the distribution by myself. However, it seems, from the options, that the distribution is computed while building the graph. The Histogram class comes with all the bells and whistles, and I would like to use it. Am I right? Can I access it?

I have also perused your examples, but I cannot find what I am looking for:

  • Example kmer/kmer13.cpp shows how to build the histogram for a Bank.
  • Example debruijn/debruijn26.cpp shows how the get the abundance of a node.
  • Example storage/storage6.cpp shows how to get the distribution from a file generated by DSK.

Any pointers?

gatb • 3.2k views
ADD COMMENT
0
Entering edit mode

Many thanks, guys!

I am really sorry that I did not make my question more clear: I wanted to program it using the gatb-core package.

My aim is gatb in my program, which needs this histogram (and could make used of the Histogram class).

ADD REPLY
2
Entering edit mode

You can retrieve the histogram information as a typed collection in the HDF5 file. This can be done with the gatb-core library this way:

#include <gatb/gatb_core.hpp>
using namespace std;

int main (int argc, char* argv[])
{
    // We load the HDF5 file, output of DSK
    Storage* dskoutput = StorageFactory(STORAGE_HDF5).load (argv[1]);
    LOCAL (dskoutput);

    // We get the histogram group
    Group& histoGroup = dskoutput->getGroup("histogram");

    // We get the histogram collection (couples of [index, abundance])
    Collection<IHistogram::Entry>& collection = histoGroup.getCollection<IHistogram::Entry> ("histogram");

    // We iterate the histogram information.
    Iterator<IHistogram::Entry>* iter = collection.iterator();
    LOCAL (iter);
    for (iter->first(); !iter->isDone(); iter->next())
    {
        cout << iter->item().index << " " << iter->item().abundance << endl;
    }
}

Actually, the Histogram class is not able to load the HDF5 file and provide an API for getting the information ; note that such a feature could be added in a future version.

ADD REPLY
0
Entering edit mode

Cool! Exactly what I needed. Sorry, I did not get that DSK was internally used for creating the HDF5 file.

ADD REPLY
3
Entering edit mode
9.2 years ago

Hi Matthias,

Here is a quick and dirty answer.

First extract the histogram from the h5 file:

gatb-core/bin/h5dump -y -d histogram/histogram mygraph.h5 | grep "^\ *[0-9]" | paste - - | sed s/\ *//g | sed s/\,// > histo_data

Then plot the distribution:

Rscript plot-kmer-histo.R histo_data

With plot-kmer-histo.R (from Claire Lemaitre)

#!/usr/bin/Rscript

#usage :  ./plot-kmer-histo.R  dsk.histo
# or      RScript plot-kmer-histo.R  dsk.histo
args=commandArgs(TRUE)
if (length(args)<1) {
cat("Usage:\n")
cat("./plot-kmer-histo.R dsk.histo [output.png]\n")
cat("1 obligatory argument :\n   - kmer histo file (as output by DSK)\n")
cat("optional arguments :\n   - output file (default = input_file.png)\n")
quit()
}

hist.file=args[1]

tab=read.table(hist.file)[,1:2]

if (length(args)>1) {
    png.file=args[2]
}else{
    png.file=paste(hist.file,".png",sep="")
}
bitmap(png.file,"png256",width=7,height=6,res=300)
suppressWarnings (plot(tab$V1,tab$V2,type="h",log="y",xlab="kmer coverage",ylab="count",main=""))
d=dev.off()
ADD COMMENT
0
Entering edit mode

I accepted this answer, because it precisely answer the (ill formulated) question.

I also upvoted the comment that helped me solving my problem.

ADD REPLY
1
Entering edit mode
9.2 years ago
edrezen ▴ 730

Hello,

If you just want get the histogram without programming, you can use the dbgh5 binary provided by the gatb-core archive.

For example, you can run dbgh5 -in myreads.fa -bloom none and you will get as output a HDF5 file myreads.h5 holding information about the kmers and their abundances (-bloom none is used to stop just after the kmers counting).

Then, you can use h5dump (also provided by gatb-core) to extract the histogram information from the h5 file with the following command:

h5dump -y -d histogram/histogram myreads.h5 | grep [0-9] | grep -v [A-Z].* | paste - -

=> each line gives a couple [x,y] where x is the kmer occurence and y how many distinct kmers have occurence number x.

If you want to get the kmers histogram while using the gatb-core library in your own C/C++ software, I can give you an explicit code sample.

ADD COMMENT

Login before adding your answer.

Traffic: 1978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6