The below bash script utilizes samtools in parellel to convert all gz.sam in a directory, sort, and index (I think). However, it seems to be processing a long time and I am not sure if it is indexed. Is there a better, more-efficient way? Without the .gz it is much faster. I am using samtools 1.9. Thank you :).
logfile=/path/to/fastq/process.log
dir=/path/to/fastq/
cd "$dir"
x=$(ls -dq *.sam* | wc -l)
echo "Starting conversion of" $x "sam files on" $(date) >> "$logfile"
ls *.sam | parallel "samtools view -b -S {} | samtools sort - {.}"
echo "conversion of" $x "sam files complete and coverted to sorted bam on" $(date) >> "$logfile"
Less likely that there is going to be a faster way (you are already using
parallel
and all cores you have access to locally?). Unless you access to a large cluster with hundreds of CPU's and a really high-performance file system where you can start all jobs at the same time.samtools sort
accept sam as input and can output in bam format. So there is no need to usesamtools view
before. But I don't know whether this is time consuming.Furthermore there is the
-@
for using multiple threads. But again I don't know if you can save a lot of time with this. The bottleneck is the sorting itself.fin swimmer
From personal experience,
sambamba
runs faster. After I made the switch, I haven't gone back to benchmarksamtools
with the latest versions in the past couple of years, so the 2 tools might perform similarly now. Instead of ramping up all the available cores to run jobs simultaneously, providing more memory per sort will make it run a lot quicker. For instance, sort using 8 cores with 32GB of memory (4G/core) will very likely finish quicker than using 32 cores with 8GB of total memory. Also, if you have multiple storage options (i,e network-based vs instance-store in the cloud), set your temp directory to utilize the fastest storage.