Hi
I am running samtools sort
on my cluster with a preinstalled version (maybe outdated), its version is
$ samtools --help
Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1 (using htslib 1.3.1)
Usage: samtools <command> [options]
I use the following command to sorted my bam file parallel
$ bwa mem ref in.fq1 in fq2 | samtools sort > out.bam
I used snakemake
to manager my pipeline, so here sort step will running parallelized
I note a error today
[bam_sort_core] merging from 154 files...
[E::hts_open_format] fail to open file 'xxxx.NNNN.bam' (No such a file)
[bam_merge_core] fail to open file xxxx.NNNN.bam
It seem that samtools sort
will reuse temp file if not specific temp dir? I wonder if this could lead to unexpected behavior.
=====update=====
I agree with seidel's point of view, but the reality is that we have already operated a huge process without knowing about this issue, and now we must assess the potential harm of this issue.
There are two conflicting aspects here. On the one hand, we have observed this conflict, but on the other hand, since samtools 1.3.1 (conda version), the file name of the tmp file has been changed to samtools.pid.NNN.tmp.bam. As it includes the pid as a prefix, it is almost impossible for the file names to be duplicated. However, we have noticed unexpected behavior when running a large number of tasks.
On the other hand, as mentioned by @jmarshall, a contributor to samtools, it seems that the internal file checking mechanism will prevent the most destructive results from occurring.
I don't think we should worry about that. If the file is left behind by a previous aborted sort/merge, it wouldn't be true to say it's opened exclusively by another process.
Can't write to temporary file tmp/.0000.bam: File exists I think will do. Which is not actually what it says now… which reminds me… sorry, I need to look at #467 before we revisit this. Thanks, will look at that and #490 together.
see: https://github.com/samtools/samtools/issues/432
reference:
github: samtools sort clobbers temporary files if "misusing" -T
github: Default samtools sort temp-prefix leads to data corruption when reading from stdin
I don't think
sort
will run in parallel unless you specify multiple threads by using-@
option. If you have the disk space available you could simply let the temp file get written to the current directory. They are cleaned up once the sort is complete.This may be fixed in the current release. Admittedly you are running an old version of samtools.
as @seidel's reply, when snakemake running my job in parallel, many samtools task may runing together with same temp dir
as @seidel's reply, when you use snakemake to manager your workflow, it will start
samtools sort
in parallel (Its means that may start indepent samtools sort task) with shared temp namesI had a problem like this with snakemake running my job in parallel, thus since sort was being invoked multiple times (thus in parallel) sometimes the temp files had shared names, which would mess everything up. I used a little python lambda snippet in snakemake to always invoke sort with a unique random alphanumeric prefix since this is a samtools sort option (-T PREFIX). (In the end I think it had to do with some demultiplexing step where intermediate files might share a name if run in parallel, but not in serial, and I realized I could write my pipeline so that no two files would ever share the same name no matter how it was called, but I don't recall). Anyway, assigning a random prefix during the sort step saved me at some point, and seemed like a good easy idea to implement.
Yes, I agree with your point of view, but the reality is that we have already operated a huge process without knowing about this issue, and now we must assess the potential harm of this issue.
There are two conflicting aspects here. On the one hand, we have observed this conflict, but on the other hand, since samtools 1.31 (conda version), the file name of the tmp file has been changed to samtools.pid.NNN.tmp.bam. As it includes the pid as a prefix, it is almost impossible for the file names to be duplicated. However, we have noticed unexpected behavior when running a large number of tasks.
On the other hand, as mentioned by @jmarshall, a contributor to samtools, it seems that the internal file checking mechanism will prevent the most destructive results from occurring.
see: https://github.com/samtools/samtools/issues/432