Hello,
for the purpose of differential splicing analysis (on human samples), I have been running KisSplice on several datasets, with success. But recently, a particularly heavy batch of rna-seq data is causing me trouble.
The largest analysis so far was a total of 60 paired-end samples, with something like 40M reads per sample (up to 110M reads for 3 samples). I don't quite remember the total size of the uncompressed fastq, but I think it was something like 600Go of data. At first, HDD space was an issue (as we need 5 times the input volume, in free space, if I remember correctly), but a new 12To's HDD solved this. I always run it with default parameters, but for this batch, the timeout value was reach (the final results with KissDE were good nonetheless). This run still took 13 days to complete, on a Intel Xeon E5-2609 v4, with 64Go RAM.
kissplice -t 16 -r ... -d ... -o ...
One of the thing I don't quite understand is that the job still seems to be running on only one core even with -t 16, well I should say, on only one of the 16 CPU shown in the system monitor in Ubuntu. Even if Intel tells me this intel xeon has 8 cores and 8 threads (although maybe something is beyond my understanding, because I'm not really good in this field).
The error come from the current dataset I'm analysing. There are 32 samples (paired-end), ~120M reads by sample, for a total size of 1.6To of uncompressed fastq. It runed for a whole month before stopping with :
Problem with /usr/local/libexec/kissplice/ks_debruijn4
And that's it. Nothing more in the log than the input command. There was still a few thing that I don't usually get on smaller dataset on the console ( [...] for shortening a LOT of content ) :
[09:59:21 26/08/2019] --> Building de Bruijn graph...
Graph will be written in /[...].[edges/nodes]
taille cell 32
Sequentially counting ~5655653 MB of kmers with 189 partition(s) and 42 passes using 1 thread(s), ~1024 MB of memory and ~137841 MB of disk space
| First step: Converting input file into Binary format |
[-------------------------------------------------------------------------------------------]
| Counting kmers |
1 % elapsed: 617 min 46 sec estimated remaining: 61158 min
[...]
100 % elapsed: 31632 min 58 sec estimated remaining: 0 min 0 sec
-------------------Counted kmers time Wallclock 1.90888e+06 s
------------------ Counted kmers and kept those with abundance >=2,
Writing positive Bloom Kmers 2867940000
6867663442 kmers written
-------------------Write all positive kmers time Wallclock 14315.3 s
Build Hash table 26840000End of debloom partition 26843546 / 26843545
6811627761 false positives written , partition 0
Build Hash table 53680000End of debloom partition 26843546 / 26843545
[...]
927413959 false positives written , partition 105
Build Hash table 2867940000Total nb false positives stored in the Debloom hashtable 880386881
-------------------Debloom time Wallclock 364595 s
Insert solid Kmers in Bloom 2867940000-------------------build DBG time Wallclock 384405 s
______________________________________________________
_______________________________________ minigraph_____
______________________________________________________
Extrapolating the number of branching kmers from the first 3M kmers: 150153807
Looping through branching kmer n° 431379600 / 431379809
-------------------nodes construction time Wallclock 31794.9 s
Problem with /usr/local/libexec/kissplice/ks_debruijn4
Just the beginning : "using 1 thread(s), ~1024 MB of memory ", is this normal ? All the other info never showed up for the other datasets.
I tried a run with only 4 of these samples, and it went without a hitch (but it still took something like 50 hours).
I don't get what could be the problem. I have plenty of free space in the temp directory (a little more than 11To) and 64Go of RAM + 10Go of swap memory. The computer never seemed to be particularly stressed, one CPU at 100% and the 15 other below 5%, and the RAM never got beyond 8 or 9Go (although I can't be sure for this, but I never had a message about not having enough RAM, which I did get while running some other tools (not at the same time as kissplice of course)).
I almost forgot, before each run, I do :
ulimit -s unlimited
To set the stack size on unlimited, else it's at 8192kb by default. (not doing that was a problem on the first analysis I did with kissplice, so now it is mandatory).
EDIT: I have tried again with -z and -C option (0.10), but a power cut made me lose 3 weeks of analysis... I will try again, maybe.
EDIT2: I now have access to a computer cluster, but the data space is largely insufficient (both for the fastq storage, and the working directory). So return to step 1.
Thank you in advance for any help I could get with this!