The short answer is, unless you know a tool very well, there's very little way to know.
From a memory perspective, you can probably assume even a fairly poorly optimised script is unlikely to use much more memory than maybe 3-5X the size of the input dataset, even if it holds the whole thing in memory at once.
Storage is similar, but really this will just be whatever your input and output files are. It's less common for tools to write intermediate data to disk unless there is some sort of database or similar it just can't hold in RAM. Disk storage is so plentiful now though that I'd be surprised if this ever creates much of a concern.
CPUs are a little easier, since this is typically something you set, rather than the tool. Many tasks are not well suited to multi threading (or there aren't tools built to readily do it) so its less common that you'll come across a task/workflow that is really and truly reaping benefit from much beyond 15-20 cores, if that.
As for knowing what will be the most efficient, this is even harder to answer, because it heavily depends on how the tool was coded. You will just have to run some toy datasets with different parameters and see what works.
Thank you!
And when I am running my script, how can I check the amount of CPUs it is taking?
You can't know exactly as far as I'm aware.
You can use a tool like
htop
which will show you how much usage the processor cores are under, but this is all use, not just from your script.You can view how many processes are being run for that task, and that will roughly correlate to the number of cores in use, but many multithreading approaches don't actually 'pin' a process to a core, and they can move around depending on what the queue for different cores is like. You can also see this information in
htop
, but you can also useps
and other similar tools.Thank you so much for the help