Entering edit mode
3.7 years ago
Palgrave
▴
130
I am running samtools view by specifying multiple threads, however, the CPU never goes above 100. My server has more than 1 CPU, so why isn't using more CPUs?
samtools view --threads 20 --reference GRCh37_latest_genomic.fna -O bam -f 4 file1.cram > file1.unmapped.bam
Use the correct option for threads as indicated below. Hopefully you are doing this on a server that has a high performance file system. Normally I/O is going to be the limiting factor for operations with a large number of threads.
For CRAM/BAM transcoding jobs such as this example, CPU utilisation is significant. You're generally only going to I/O bound if you're streaming from a remote server, or are very heavily multithreading samtools. That example is almost certainly CPU bound. My guess is it's running at ~5MB/s (a.k.a. ~40Mbps) and will take several hours to complete on typical human WGS data (hence the question here).
See http://www.htslib.org/benchmarks/CRAM.html for a high-level overview of CRAM/BAM encoding/decoding costs.
using 7CPUs reduced the time for a 100GB cram file to about 30-45mins.