News:New versions of Minia and DSK (2.0.x)
0
5
Entering edit mode
10.2 years ago
Rayan Chikhi ★ 1.6k

Minia is a low-memory short-read assembler for large genomes. It creates contigs.

DSK is a low-memory k-mer counter.

We have ported Minia and DSK to a new codebase that uses the GATB library. To make the change clear, from now on, Minia and DSK using the new codebase will have versions 2.x.x.

New features:

Minia 2.0.2

  • Faster (multi-core parallelism)
  • Slightly more accurate (has coverage information in the graph, for better discrimination between sequencing errors and polymorphism)
  • Less disk usage (because of DSK)
  • Can output unitigs

DSK 2.0.2

  • Faster (multi-core parallelism)
  • Less disk usage
  • comparable performance to KMC2 (we're using their techniques :))

Download (Linux 64 bits):

For legacy, the final versions of Minia and DSK 1.xxx (old codebase) are http://minia.genouest.org/files/minia-1.6906.tar.gz and http://minia.genouest.org/dsk/dsk-1.6906.tar.gz.

However we recommend using the 2.x.x versions, as results are expected to be identical (in the case of DSK) or slightly better (Minia), while 2.x.x performance is significantly better (2x-4x) than 1.xxx versions.

You might be tempted to reply to this post in case you find a bug, or an installation problem, etc... But please make a new Biostar post instead:

minia dsk assembly gatb • 4.6k views
ADD COMMENT
1
Entering edit mode

Nice. I am trying it out now on some reads I assembled last night with Abyss to compare.

BTW, on my Ubuntu distro (12.04), I had to:

sudo apt-get install libstdc++6

To get precompiled minia to run.

ADD REPLY
1
Entering edit mode

Thanks. Going to fix that shortly (DSK fixed already -- Minia compatible binaries coming). EDIT: done

ADD REPLY
0
Entering edit mode

DSK binary not working on centos5, also due to libstdc++.

ADD REPLY
0
Entering edit mode

Oh.. OK, let's see, I have re-created the 2.0.1 binaries (minia+dsk) using static linking (-static flag) and static linking of libstdc++ (-static-libstdc++ flag). It gave me a warning ("Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking") but the binary seems to work on several different machines.

I don't have any centos5 machine but out of curiosity I tested using Docker:

sudo docker run -i -t centos:centos5 /bin/bash

(inside the docker image:)

cd /tmp && yum install -y wget && wget http://gatb-tools.gforge.inria.fr/versions/bin/dsk-2.0.1-Linux.tar.gz && tar xf dsk-2.0.1-Linux.tar.gz
dsk-2.0.1-Linux/bin/dsk

and it didn't complain about glibc or libstdc++.

Just for clarity (if anyone is confused by these command lines), there is no need to go through all of this to run DSK: the binary should work on linux 64 bits right away. This was just to illustrate how to test a program on Centos5.

ADD REPLY
0
Entering edit mode

I got FATAL: kernel too old on a centos5 VirtualBox. Probably docker won't solve kernel problems. I have compiled a version here on centos5. Most broad machines are centos5, so I care. I run a clean centos5 VirtualBox just for compiling.

ADD REPLY
1
Entering edit mode

Thanks, good to know that Docker isn't sufficient for kernel compatibility.

I've compiled a new release (that includes minor bugfixes), DSK/Minia 2.0.2, using a centos5 virtualbox.

ADD REPLY
0
Entering edit mode

What I like about kmc2 is that it provides relatively standalone lightweight APIs to access the k-mer count files. I can embed several c++ files directly into my source code and forget about extra dependencies. I assume to read dsk counts, I have to use the entire gatb?

ADD REPLY
0
Entering edit mode

That's a good point.. the answer is "yes" as of today.

The output of DSK is in HDF5 format. As @edrezen just told me, even if we remove the GATB dependency for parsing DSK results, you'd still need a HDF5 parser. At this point, since the hdf5 library is quite big, one might as well include the whole GATB.

If a developer is serious about parsing DSK results inside his software, please get in touch with us, I'm sure we can work something out (such as making DSK return an easy-to-parse, non-HDF5 output format). However I'm missing a clear picture of an actual use case: if a developer has to parse DSK output (or KMC for that matter), is he packaging the source, or a binary, of DSK (resp. KMC) along?

ADD REPLY
0
Entering edit mode

I use KMC2 for toy projects. I ask users to download and run the official KMC2 by themselves. I don't package the KMC2 binary. I only use several of its files to read KMC2 k-mer counts. Bless, an error corrector, uses KMC2, too. It packages all the KMC2 source code as it has modified KMC2 to support MPI. Bless calls its own version of KMC2. It does not work with the official KMC2. Lightweight API to access k-mer counts is of course not essential, but having this will encourage other developers to use dsk.

ADD REPLY
0
Entering edit mode

Oh I see.. also your error correction tool BFC (the KMC2 branch) provides a concrete example.

Didn't know about Bless' KMC2 modification, nice! For anyone interested (probably Guillaume will be), here is the diff between the kmer_counter folders of original KMC2 and Bless':

diff -ENwbu ./bkb_merger.h KMC/kmer_counter/bkb_merger.h
--- ./bkb_merger.h 2015-01-29 22:37:50.000000000 +0100
+++ KMC/kmer_counter/bkb_merger.h 2015-02-19 11:14:42.806689539 +0100
@@ -133,8 +133,7 @@
template<typename KMER_T, unsigned SIZE>
void CBigKmerBinMerger<KMER_T, SIZE>::Process()
{
- // BLESS
- int32 bin_id(0);
+ int32 bin_id;
uint32 size = 0;
uint32 counter_size = min(BYTE_LOG(cutoff_max), BYTE_LOG(counter_max));
uint32 lut_recs = 1 << 2 * lut_prefix_len;
diff -ENwbu ./defs.h KMC/kmer_counter/defs.h
--- ./defs.h 2015-01-29 22:37:50.000000000 +0100
+++ KMC/kmer_counter/defs.h 2015-02-19 11:14:42.806689539 +0100
@@ -83,10 +83,6 @@
#define MIN_SR 1
#define MAX_SR 16
-// BLESS
-#define MIN_NODES 1
-#define MIN_RANK 0
-
typedef float count_t;
diff -ENwbu ./kb_reader.h KMC/kmer_counter/kb_reader.h
--- ./kb_reader.h 2015-01-29 22:37:50.000000000 +0100
+++ KMC/kmer_counter/kb_reader.h 2015-02-19 11:14:42.806689539 +0100
@@ -44,10 +44,6 @@
int32 lut_prefix_len;
uint32 max_x;
- // BLESS
- int num_nodes;
- int rank;
-
bool both_strands;
bool use_quake;
@@ -86,10 +82,6 @@
max_x = Params.max_x;
s_mapper = Queues.s_mapper;
lut_prefix_len = Params.lut_prefix_len;
-
- // BLESS
- num_nodes = Params.num_nodes;
- rank = Params.rank;
}
//----------------------------------------------------------------------------------
@@ -117,8 +109,6 @@
bd->init_random();
while((bin_id = bd->get_next_random_bin()) >= 0) // Get id of the next bin to read
{
- // BLESS
- if ((bin_id % num_nodes) == rank) {
bd->read(bin_id, file, name, size, n_rec, n_plus_x_recs, buffer_size, kmer_len);
fflush(stdout);
@@ -198,11 +188,9 @@
//Remove temporary file
#ifndef DEVELOP_MODE
-// DEBUG
-// file->Remove();
+ file->Remove();
#endif
disk_logger->log_remove(size);
- } // BLESS
}
diff -ENwbu ./kb_storer.cpp KMC/kmer_counter/kb_storer.cpp
--- ./kb_storer.cpp 2015-01-29 22:37:49.000000000 +0100
+++ KMC/kmer_counter/kb_storer.cpp 2015-02-19 11:14:42.806689539 +0100
@@ -32,10 +32,6 @@
bd = Queues.bd;
working_directory = Params.working_directory;
- // BLESS
- num_nodes = Params.num_nodes;
- rank = Params.rank;
-
mem_mode = Params.mem_mode;
s_mapper = Queues.s_mapper;
@@ -194,15 +190,12 @@
for(int i = 0; i < n_bins; ++i)
{
- // BLESS
- if ((i % num_nodes) == rank) {
f_name = GetName(i);
buf_sizes[i] = 0;
files[i]->Open(f_name);
bd->insert(i, files[i], f_name, 0, 0, 0, 0);
- } // BLESS
}
return true;
diff -ENwbu ./kb_storer.h KMC/kmer_counter/kb_storer.h
--- ./kb_storer.h 2015-01-29 22:37:50.000000000 +0100
+++ KMC/kmer_counter/kb_storer.h 2015-02-19 11:14:42.806689539 +0100
@@ -40,10 +40,6 @@
uint64 max_mem_buffer;
uint64 max_mem_single_package;
- // BLESS
- int num_nodes;
- int rank;
-
CSignatureMapper *s_mapper;
CDiskLogger *disk_logger;
uchar* tmp_buff;
Common subdirectories: ./KMC and KMC/kmer_counter/KMC
diff -ENwbu ./kmer_counter.cpp KMC/kmer_counter/kmer_counter.cpp
--- ./kmer_counter.cpp 2015-01-29 22:37:50.000000000 +0100
+++ KMC/kmer_counter/kmer_counter.cpp 2015-02-19 11:14:42.806689539 +0100
@@ -153,11 +153,6 @@
cout << " -sp<value> - number of splitting threads\n";
cout << " -sr<value> - number of sorter threads\n";
cout << " -so<value> - number of threads per single sorter\n";
-
- // BLESS
- cout << " -d<value> - number of nodes\n";
- cout << " -a<value> - rank of a current node\n";
-
cout << "Example:\n";
cout << "kmc -k27 -m24 NA19238.fastq NA.res \data\kmc_tmp_dir\\n";
cout << "kmc -k27 -q -m24 @files.lst NA.res \data\kmc_tmp_dir\\n";
@@ -173,10 +168,6 @@
if(argc < 4)
return false;
- // BLESS
- Params.num_nodes = -1;
- Params.rank = -1;
-
for(i = 1 ; i < argc; ++i)
{
if(argv[i][0] != '-')
@@ -253,31 +244,6 @@
Params.p_mem_mode = true;
else if(strncmp(argv[i], "-b", 2) == 0)
Params.p_both_strands = false;
- // BLESS
- // number of nodes
- else if(strncmp(argv[i], "-d", 2) == 0)
- {
- tmp = atoi(&argv[i][2]);
- if(tmp < MIN_NODES)
- {
- cout << "Wrong parameter: the number of nodes " << tmp << " should be >= " << MIN_NODES << "\n";
- return false;
- }
- else
- Params.num_nodes = tmp;
- }
- // rank
- else if(strncmp(argv[i], "-a", 2) == 0)
- {
- tmp = atoi(&argv[i][2]);
- if(tmp < MIN_RANK)
- {
- cout << "Wrong parameter: the rank of nodes should be >= " << MIN_RANK << "\n";
- return false;
- }
- else
- Params.rank = tmp;
- }
// Number of reading threads
else if(strncmp(argv[i], "-sf", 3) == 0)
{
@@ -373,20 +339,6 @@
}
}
- // BLESS
- if (Params.num_nodes == -1) {
- cout << "No parameter: the option -n is mandatory" << "\n";
- return false;
- }
- else if (Params.rank == -1) {
- cout << "No parameter: the option -rank is mandatory" << "\n";
- return false;
- }
- else if (Params.rank >= Params.num_nodes) {
- cout << "Wrong parameter: the rank " << Params.rank << " should be smaller than the number of nodes " << Params.num_nodes << "\n";
- return false;
- }
-
if(argc - i < 3)
return false;
Common subdirectories: ./libs and KMC/kmer_counter/libs
diff -ENwbu ./params.h KMC/kmer_counter/params.h
--- ./params.h 2015-01-29 22:37:50.000000000 +0100
+++ KMC/kmer_counter/params.h 2015-02-19 11:14:42.814689539 +0100
@@ -105,10 +105,6 @@
vector<int> n_omp_threads;// number of OMP threads per sorters
uint32 max_x; //k+x-mers will be counted
- // BLESS
- int num_nodes;
- int rank;
-
uint32 gzip_buffer_size;
uint32 bzip2_buffer_size;
diff -ENwbu ./splitter.h KMC/kmer_counter/splitter.h
--- ./splitter.h 2015-01-29 22:37:50.000000000 +0100
+++ KMC/kmer_counter/splitter.h 2015-02-19 11:14:42.814689539 +0100
@@ -77,8 +77,7 @@
void InitBins(CKMCParams &Params, CKMCQueues &Queues);
~CSplitter();
- // BLESS
- bool ProcessReads(uchar *_part, uint64 _part_size, int num_nodes, int rank);
+ bool ProcessReads(uchar *_part, uint64 _part_size);
void Complete();
void GetTotal(uint64 &_n_reads);
@@ -97,14 +96,12 @@
template <> class CSplitter_Impl<false> {
public:
- // BLESS
- static bool ProcessReads(CSplitter<false> &ptr, uchar *_part, uint64 _part_size, int num_nodes, int rank);
+ static bool ProcessReads(CSplitter<false> &ptr, uchar *_part, uint64 _part_size);
};
template <> class CSplitter_Impl<true> {
public:
- // BLESS
- static bool ProcessReads(CSplitter<true> &ptr, uchar *_part, uint64 _part_size, int num_nodes, int rank);
+ static bool ProcessReads(CSplitter<true> &ptr, uchar *_part, uint64 _part_size);
};
//----------------------------------------------------------------------------------
@@ -524,11 +521,9 @@
//----------------------------------------------------------------------------------
// Process the reads from the given FASTQ file part
-// BLESS
-template <bool QUAKE_MODE> bool CSplitter<QUAKE_MODE>::ProcessReads(uchar *_part, uint64 _part_size, int num_nodes, int rank)
+template <bool QUAKE_MODE> bool CSplitter<QUAKE_MODE>::ProcessReads(uchar *_part, uint64 _part_size)
{
- // BLESS
- return CSplitter_Impl<QUAKE_MODE>::ProcessReads(*this, _part, _part_size, num_nodes, rank);
+ return CSplitter_Impl<QUAKE_MODE>::ProcessReads(*this, _part, _part_size);
}
//----------------------------------------------------------------------------------
@@ -545,8 +540,7 @@
//----------------------------------------------------------------------------------
// Process the reads from the given FASTQ file part
-// BLESS
-bool CSplitter_Impl<false>::ProcessReads(CSplitter<false> &ptr, uchar *_part, uint64 _part_size, int num_nodes, int rank)
+bool CSplitter_Impl<false>::ProcessReads(CSplitter<false> &ptr, uchar *_part, uint64 _part_size)
{
ptr.part = _part;
ptr.part_size = _part_size;
@@ -596,11 +590,8 @@
if (len >= ptr.kmer_len)
{
bin_no = ptr.s_mapper->get_bin_id(current_signature.get());
- // BLESS
- if ((bin_no % num_nodes) == (unsigned int)rank) {
ptr.bins[bin_no]->PutExtendedKmer(seq + i - len, len);
}
- }
len = 0;
++i;
break;
@@ -611,10 +602,7 @@
if (len >= ptr.kmer_len)
{
bin_no = ptr.s_mapper->get_bin_id(current_signature.get());
- // BLESS
- if ((bin_no % num_nodes) == (unsigned int)rank) {
ptr.bins[bin_no]->PutExtendedKmer(seq + i - len, len);
- }
len = ptr.kmer_len - 1;
}
current_signature.set(end_mmer);
@@ -628,10 +616,7 @@
else if (signature_start_pos + ptr.kmer_len - 1 < i)//need to find new signature
{
bin_no = ptr.s_mapper->get_bin_id(current_signature.get());
- // BLESS
- if ((bin_no % num_nodes) == (unsigned int)rank) {
ptr.bins[bin_no]->PutExtendedKmer(seq + i - len, len);
- }
len = ptr.kmer_len - 1;
//looking for new signature
++signature_start_pos;
@@ -652,10 +637,7 @@
if (len == ptr.kmer_len + 255) //one byte is used to store counter of additional symbols in extended k-mer
{
bin_no = ptr.s_mapper->get_bin_id(current_signature.get());
- // BLESS
- if ((bin_no % num_nodes) == (unsigned int)rank) {
ptr.bins[bin_no]->PutExtendedKmer(seq + i + 1 - len, len);
- }
i -= ptr.kmer_len - 2;
len = 0;
break;
@@ -666,12 +648,9 @@
if (len >= ptr.kmer_len)//last one in read
{
bin_no = ptr.s_mapper->get_bin_id(current_signature.get());
- // BLESS
- if ((bin_no % num_nodes) == (unsigned int)rank) {
ptr.bins[bin_no]->PutExtendedKmer(seq + i - len, len);
}
}
- }
putchar('*');
fflush(stdout);
@@ -684,8 +663,7 @@
//----------------------------------------------------------------------------------
// Process the reads from the given FASTQ file part
-// BLESS: the quake mode is true: no modification
-bool CSplitter_Impl<true>::ProcessReads(CSplitter<true> &ptr, uchar *_part, uint64 _part_size, int num_nodes, int rank)
+bool CSplitter_Impl<true>::ProcessReads(CSplitter<true> &ptr, uchar *_part, uint64 _part_size)
{
ptr.part = _part;
ptr.part_size = _part_size;
@@ -827,10 +805,6 @@
CSplitter<QUAKE_MODE> *spl;
uint64 n_reads;
- // BLESS
- int num_nodes;
- int rank;
-
public:
CWSplitter(CKMCParams &Params, CKMCQueues &Queues);
~CWSplitter();
@@ -848,10 +822,6 @@
pmm_fastq = Queues.pmm_fastq;
spl = new CSplitter<QUAKE_MODE>(Params, Queues);
spl->InitBins(Params, Queues);
-
- // BLESS
- num_nodes = Params.num_nodes;
- rank = Params.rank;
}
//----------------------------------------------------------------------------------
@@ -871,8 +841,7 @@
uint64 size;
if(pq->pop(part, size))
{
- // BLESS
- spl->ProcessReads(part, size, num_nodes, rank);
+ spl->ProcessReads(part, size);
pmm_fastq->free(part);
}
}
Common subdirectories: ./x64 and KMC/kmer_counter/x64
view raw - hosted with ❤ by GitHub

ADD REPLY

Login before adding your answer.

Traffic: 1202 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6