I recently ran AUGUSTUS for ab initio gene prediction on my 2.6GB plant genome using an 8-core server with 244GB memory.
However, AUGUSTUS utilized only one CPU core, resulting in slow performance (~25MB *.gff3 output per day). After reviewing the AUGUSTUS documentation, I couldn't find a parameter to set the number of CPU cores.
To overcome this, I used GNU parallel to run multiple AUGUSTUS instances in parallel by splitting the main FASTA file into chunks corresponding to the number of cores. I've documented my protocol in a tutorial on my page and would love to share it with the community.
Thank you so much, Dr. Lindenbaum, for the valuable comment. I'm not quite experienced in nextflow, but I like to test this out. Can you provide some details about this code, or any resources to understand it?
Instead of making a python script for splitting, you can use --block -1 --pipe-part --cat --recend "\n" --recstart ">":
parallel --block -1 -a big.fasta --pipepart --cat --recend "\n" --recstart ">" augustus [...] {}
This will automatically split the fasta file into 1 chunk per CPU thread. It will save the chunks into temporary files before calling augustus.
If augustus can read from stdin (e.g. by: augustus -) you can bypass generating the temporary files:
parallel --block -1 -a big.fasta --pipepart --recend "\n" --recstart ">" augustus [...] -
If augustus has very varying runtime, it might make sense to split big.fasta into more chunks, say, 3 per CPU thread: --block -3 This if a single chunk takes forever, then the other CPU threads will pick up the other chunks.
Thank you so much, Dr. Lindenbaum, for the valuable comment. I'm not quite experienced in nextflow, but I like to test this out. Can you provide some details about this code, or any resources to understand it?
Start here: https://www.nextflow.io/