Does running samtools sort in parallel result in unexpected output due to temporary file conflicts?
1
0
Entering edit mode
21 months ago
octpus616 ▴ 120

Hi

I am running samtools sort on my cluster with a preinstalled version (maybe outdated), its version is

$ samtools --help

Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1 (using htslib 1.3.1)

Usage:   samtools <command> [options]

I use the following command to sorted my bam file parallel

$ bwa mem ref in.fq1 in fq2 | samtools sort > out.bam

I used snakemake to manager my pipeline, so here sort step will running parallelized

I note a error today

[bam_sort_core] merging from 154 files...
[E::hts_open_format] fail to open file 'xxxx.NNNN.bam' (No such a file)
[bam_merge_core] fail to open file xxxx.NNNN.bam

It seem that samtools sort will reuse temp file if not specific temp dir? I wonder if this could lead to unexpected behavior.

=====update=====

I agree with seidel's point of view, but the reality is that we have already operated a huge process without knowing about this issue, and now we must assess the potential harm of this issue.

There are two conflicting aspects here. On the one hand, we have observed this conflict, but on the other hand, since samtools 1.3.1 (conda version), the file name of the tmp file has been changed to samtools.pid.NNN.tmp.bam. As it includes the pid as a prefix, it is almost impossible for the file names to be duplicated. However, we have noticed unexpected behavior when running a large number of tasks.

On the other hand, as mentioned by @jmarshall, a contributor to samtools, it seems that the internal file checking mechanism will prevent the most destructive results from occurring.

I don't think we should worry about that. If the file is left behind by a previous aborted sort/merge, it wouldn't be true to say it's opened exclusively by another process.

Can't write to temporary file tmp/.0000.bam: File exists I think will do. Which is not actually what it says now… which reminds me… sorry, I need to look at #467 before we revisit this. Thanks, will look at that and #490 together.

see: https://github.com/samtools/samtools/issues/432

reference:

github: samtools sort clobbers temporary files if "misusing" -T

github: Default samtools sort temp-prefix leads to data corruption when reading from stdin

NGS samtools bam • 2.1k views
ADD COMMENT
0
Entering edit mode

I don't think sort will run in parallel unless you specify multiple threads by using -@ option. If you have the disk space available you could simply let the temp file get written to the current directory. They are cleaned up once the sort is complete.

This may be fixed in the current release. Admittedly you are running an old version of samtools.

ADD REPLY
0
Entering edit mode

as @seidel's reply, when snakemake running my job in parallel, many samtools task may runing together with same temp dir

ADD REPLY
0
Entering edit mode

as @seidel's reply, when you use snakemake to manager your workflow, it will start samtools sort in parallel (Its means that may start indepent samtools sort task) with shared temp names

ADD REPLY
0
Entering edit mode

I had a problem like this with snakemake running my job in parallel, thus since sort was being invoked multiple times (thus in parallel) sometimes the temp files had shared names, which would mess everything up. I used a little python lambda snippet in snakemake to always invoke sort with a unique random alphanumeric prefix since this is a samtools sort option (-T PREFIX). (In the end I think it had to do with some demultiplexing step where intermediate files might share a name if run in parallel, but not in serial, and I realized I could write my pipeline so that no two files would ever share the same name no matter how it was called, but I don't recall). Anyway, assigning a random prefix during the sort step saved me at some point, and seemed like a good easy idea to implement.

ADD REPLY
0
Entering edit mode

Yes, I agree with your point of view, but the reality is that we have already operated a huge process without knowing about this issue, and now we must assess the potential harm of this issue.

There are two conflicting aspects here. On the one hand, we have observed this conflict, but on the other hand, since samtools 1.31 (conda version), the file name of the tmp file has been changed to samtools.pid.NNN.tmp.bam. As it includes the pid as a prefix, it is almost impossible for the file names to be duplicated. However, we have noticed unexpected behavior when running a large number of tasks.

On the other hand, as mentioned by @jmarshall, a contributor to samtools, it seems that the internal file checking mechanism will prevent the most destructive results from occurring.

I don't think we should worry about that. If the file is left behind by a previous aborted sort/merge, it wouldn't be true to say it's opened exclusively by another process.

Can't write to temporary file tmp/.0000.bam: File exists I think will do. Which is not actually what it says now… which reminds me… sorry, I need to look at #467 before we revisit this. Thanks, will look at that and #490 together.

see: https://github.com/samtools/samtools/issues/432

ADD REPLY
1
Entering edit mode
21 months ago
ATpoint 86k

First of all update samtools. Your version is literally ancient (april 2016) and there is no point debugging deprecated software. Use of a container (docker/podman, singularity/apptainer) is preferred for full reproducibility. This bwa/samtools pipe is among the most frequently used commands used in bioinformatics, lets see whether a software update and cleaning up remainings of a previous run solves it. Using -T prefix makes sense but actually, how is it with SnakeMake, doesn't it use isolated work/run directories for separate jobs as Nextflow does to avoid any name collisions!?

ADD COMMENT
2
Entering edit mode

^^^ THIS!

The answer to your question though is basically:

No, samtools does not have issues with clashing temporary filenames, even with threading and multiple jobs running at the same time. However your ancient samtools 1.3 almost certainly will do as it's just too old.

Save yourself some pain (and CPU) and upgrade.

ADD REPLY

Login before adding your answer.

Traffic: 2348 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6