I'm running the standard cleaning/sorting picard functions on a set of bam files, and one of the output files is suspiciously small in comparison to the input, i.e. running:
SortSam.jar SORT_ORDER=coordinate INPUT=myfile.bam OUTPUT=myfile_sort.bam
I re-ran just the SortSam step of my pipeline to see why I was getting such a small file in comparison to the 20 other bam files. One thing that I noticed is that SortSam Read would run for some time, and then give the console message
INFO 2024-04-02 16:01:07 SortSam Finished reading inputs, merging and writing to output now.
At that point, myfile_sort.bam would be generated. However, immediately afterwards, SortSam read would start running again, and myfile_sort.bam would return to being approximately 0 MB in size. It would continue to generate these "temporary" myfile_sort.bam files for 2 or 3 iterations before finally terminating and generating a final version.
My question is this: in each case, the bam files were created by merging files from different runs of the same sample. Could it be that for some strange reason (anomalous headers), SortSam is treating this particular bam file as three separate entities and overwriting each time, or does SortSam always generate temporary sorted files with preliminary sorting? Due to the long run time, I didn't check whether there were multiple iterations of this kind for the sorted bam files whose sizes more closely matched the inputs.
Thanks - evidently the temporary files are given the same name as the terminal file.
Unfortunately, I'm more or less stuck using outdated software packages and versions for the time being. The lab I'm working in generated a large number of sequenced genomes nearly a decade ago, and in order to avoid artifactual differences introduced by different mapping/genotyping tools when comparing across genomes, I'm using the decade-old pipeline for consistency. That is why I post questions about bwa -aln, stampy, older version of picard and GATK, etc.
Are you running the 20 sorts simultaneously? You could try to assign a separate
--tmp-dir
to each job and see if that mitigates the issue of