Question

samtools covert to bam, sort and index all gz.sam

0

Entering edit mode

6.6 years ago

bioguy24 ▴ 230

The below bash script utilizes samtools in parellel to convert all gz.sam in a directory, sort, and index (I think). However, it seems to be processing a long time and I am not sure if it is indexed. Is there a better, more-efficient way? Without the .gz it is much faster. I am using samtools 1.9. Thank you :).

logfile=/path/to/fastq/process.log
dir=/path/to/fastq/
cd "$dir"
x=$(ls -dq *.sam* | wc -l)
echo "Starting conversion of" $x "sam files on" $(date) >> "$logfile"
ls *.sam | parallel "samtools view -b -S {} | samtools sort - {.}"
echo "conversion of" $x "sam files complete and coverted to sorted bam on" $(date) >> "$logfile"

samtools parellel • 6.4k views

ADD COMMENT • link updated 6.6 years ago by cmdcolin ★ 4.3k • written 6.6 years ago by bioguy24 ▴ 230

1

Entering edit mode

Less likely that there is going to be a faster way (you are already using parallel and all cores you have access to locally?). Unless you access to a large cluster with hundreds of CPU's and a really high-performance file system where you can start all jobs at the same time.

ADD REPLY • link 6.6 years ago by GenoMax 153k

1

Entering edit mode

samtools sort accept sam as input and can output in bam format. So there is no need to use samtools view before. But I don't know whether this is time consuming.

Furthermore there is the -@ for using multiple threads. But again I don't know if you can save a lot of time with this. The bottleneck is the sorting itself.

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

1

Entering edit mode

From personal experience, sambamba runs faster. After I made the switch, I haven't gone back to benchmark samtools with the latest versions in the past couple of years, so the 2 tools might perform similarly now. Instead of ramping up all the available cores to run jobs simultaneously, providing more memory per sort will make it run a lot quicker. For instance, sort using 8 cores with 32GB of memory (4G/core) will very likely finish quicker than using 32 cores with 8GB of total memory. Also, if you have multiple storage options (i,e network-based vs instance-store in the cloud), set your temp directory to utilize the fastest storage.

ADD REPLY • link 6.6 years ago by Eric Lim ★ 2.2k

score 0 · Answer 1 · 2019-01-07

The question "not sure if it is indexed" is probably not relevant to sorting (indexing only happens after you have sorted bam files)

Other options for speedup

http://devblog.dnanexus.com/faster-bam-sorting-with-samtools-and-rocksdb/

https://www.basepairtech.com/blog/sorting-bam-files-samtools-vs-sambamba/