Hello,
It looks obviously like a file system issue during the DSK kmers counting step (see the traces and the bunch of HDF5 errors). HDF5 seems to have some issues that need to be investigated; issue in HDF5 itself? issue in HDF5 usage by discosnp?
It doesn't look like a full disk because there is plenty of free space (disk_current_dir : 157396.8 => 157 GB) with regard to the amount of data to write in that output HDF5 file (kmers_nb_solid : 2143612212).
Note however the following traces:
max_file_nb : 32768
nb_partitions : 880
The first line tells how many files can be open at the same time. This number is used to compute the "nb_partitions". Since "max_file_nb" is huge (a more classical value is 1024), the "nb_partitions" is huge as well and I think we never tried such high values.
Currently, in order to try to understand the issue, I would suggest two ideas:
- Try to limit the
max_file_nb
value. Since it is value set by the operating system, you must be administrator on the machine if you want to decrease it (to 1024 for instance). I think that the ulimit
shell command does the job.
- Try to limit the disk usage by using the
-max-disk
parameter of the dbgh5 command. I'm not sure that DiscoSnp++ knows this option, so you should first try to type something like /home/cmb-02/sn1/tkitapci/software/DiscoSNP++-2.2.0-Source/build//ext/gatb-core/bin/dbgh5 -in buffalo_fof.txt_removemeplease -out /staging/sn1/tkitapci/NOHA/buffalo_variant_call/Buffalo_k_31_c_auto -kmer-size 31 -abundance-min auto -abundance-max 2147483647 -solidity-kind one -max-disk 50000
With the second solution, you should get a lower value for nb_partitions
and potentially a bigger value for nb_passes
. If the dbgh5 is successful with this parameter, we will have to understand the actual issue.
Can you tell if any of the two suggestions work? And provide the output as you did before?
I tried myself to force the "max_file_nb" value to 32768 and test on some reads but I got no problem.
By the way, do you get exactly the same error if you relaunch your command ?
Why do you think it is a DSK problem Erwan? Seems that the DSK step completes fine, and the HDF5 errors appear during cascading step. This line in particular is suspicious:
157 GB of free space seems low for a 2 billion kmers analysis. Could you try freeing more space? (around 300-400 GB free total)
Hi Rayan, you're right, it was misleading to talk about DSK, I meant the de Bruijn graph building in general.
Normally, the size of the DSK contribution in the final HDF5 file is about 16*nbSolidKmers bytes, so in this case the DSK contribution is about 32 GB, less than the 157 GB available disk space. It means that the steps after DSK should have in theory 157-32=125GB available, which should be enough.
The strange part is that the issue occurs just "between" two steps of dbgh5 (debloom and branching); a lack of disk space during any HDF5 write operation should occur in the middle of any dbgh5 step and not just between two of them. The first HDF5 error "H5Gclose(): unable to close group" seems also to tell that the debloom step tries to release correctly the used resources (including HDF5 resources) but something then goes wrong.
@tkitapci, can you tell how many disk space is left after the issue occured ? As Rayan suggests, you can also try to free some disk space and relaunch the command.
Hi Erwan, also, I thought that the "bytes actually written" was a red flag, but it's actually "-1" in 64 bits representation, which is the value it is supposed to be when a write fails.
Hi,
Thanks for the reply. I re-run the command on a machine with 128GB memory I got the same error (or similar)
https://docs.google.com/document/d/18tejd1ems_CJXhnzhij9uanToJFMfe1YsDescDB3_y4/edit?usp=sharing
I am running this on our cluster I will see how can I free more space or tell the program to use a seperate disk space.
One more question: how can I check where is this disk_current_dir : 157396.8 => 157 GB located ? In the folder that I ran the command there is more than 10TB of free space. Program must be writing these files somewhere else I don't know where that 157GB of free disk came from) maybe this is the default directory that temporary files are written ?
Thanks a lot
157 GB should correspond to the directory where the dbgh5 command is launched, in your case:
so there is something odd if you checked that this directory has 10 TB of free space.
By default, all temporary files will be created in this directory. It is possible to force dbgh5 to use a specific directory for temporary files (option
-out-tmp X
). So, you could try the following line, where XXX is a directory that has a lot of free disk space.Once you have launched the command, you could check where the temporary files are actually written (they look like
trashme_PID_dsk_partitions.parts
, where PID is the process id).Thanks for the reply. In my case all the
trashme_*
files are created in the directory that I specify with the-p
command. There is about 150 TB free space in that disk so space is clearly not an issue. I think there was some sort of file system problem causing this error (which may be related to the number of files allowed in a directory). I changed my output directory to another disk and so far it is running fine. Thanks!I have solved the problem. It was likely to be related to the file system that I am using due to the large number of files being opened at the same time. I changed my output to a different file system and now I can run fine.
Thanks
Hamdi