Hello everybody,
I think I have a fairly basic question. I just discovered the GNU parallel package and I think my workflow can really benefit from it! I am using a loop which loops through my read files and generates the desired output. The command that is excecuted for each read looks something like this:
STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn R1.fq R2.fq
As you can see I specified 8 threads, which is the amount of threads my virtual machine has.
My question now is this following: If I use GNU parallel with a command like this:
cat reads| parallel -j 3 STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn {}_R1.fq {}_R2.fq
Can my virtual machine handle the number of threads I specified, if I execute 3 jobs in parallel?
Or do I need 24 threads (3*8 threads) to properly excecute this command?
Im sorry if this is a basic question, I am very new to the field and any help is much appreciated!
You can't use more cores/threads than what your hardware offers. If you are using a virtual machine you are already limited by the resources assigned to it (4 cores/8 threads in this case).
Yeah, don't mix parallel workflows and multithreaded applications unless they multiply to less than (or equal to) your CPU count. Its a good idea to leave a couple of threads spare though, as the machine will still have background tasks going that you won't want to compete with your job.
If you have 8 cores/16 threads, but you spawn
3 x 8 = 24
total processes, you will end up with CPU thrashing, and it will spend more time switching between queued tasks, and will ultimately run even slower than probably only assigning one or two threads to the process in the first place.Bear in mind, it is also generally more efficient to run
n
instances of a single core/thread process, than it is to run 1 instance of a process withn
threads. This does depend enormously on the program in question and the 'parallel-isability' of what you're doing.I often batch run
hhpred
analyses viaGNU parallel
, buthhpred
itself can use multiple threads. Typically, I'll tell it to use ~2 threads, and letGNU parallel
balance the workloads out over available cores for all my input files. To a rough approximation, on our 32 core server, I'll have 16 files being analysed concurrently.Take particular care if you're launching a lot of processes which also need read/write access to the same file. E.g., if I ran 20 scripts at once, which all needed access to a particular database file on disk, they will be competing for the I/O of that file too, so it won't necessarily result in that much of a speed up.