Question

Tutorial:Parallel AUGUSTUS Execution via GNU Parallel

0

Entering edit mode

6 weeks ago

Vijith ▴ 90

Hi, fellow bioinformaticians,

I recently ran AUGUSTUS for ab initio gene prediction on my 2.6GB plant genome using an 8-core server with 244GB memory.

However, AUGUSTUS utilized only one CPU core, resulting in slow performance (~25MB *.gff3 output per day). After reviewing the AUGUSTUS documentation, I couldn't find a parameter to set the number of CPU cores.

To overcome this, I used GNU parallel to run multiple AUGUSTUS instances in parallel by splitting the main FASTA file into chunks corresponding to the number of cores. I've documented my protocol in a tutorial on my page and would love to share it with the community.

Are there alternative methods to make AUGUSTUS utilize multiple cores? Please share your insights. Link to tutorial: [https://lifescienceshub.wixsite.com/lifesciencehub/post/how-to-leverage-gnu-parallel-to-utilize-multiple-cores-while-running-augustus\]

augustus genome ngs parallel • 644 views

ADD COMMENT • link updated 6 weeks ago by ole.tange ★ 4.5k • written 6 weeks ago by Vijith ▴ 90

score 2 · Accepted Answer · 2024-10-05

2

Entering edit mode

6 weeks ago

Pierre Lindenbaum 164k

Nice, but you'd better learn to use a workflow manager like snakemake or nextflow. See NF below (not tested):

workflow {
        ch0 = Channel.fromPath(params.fasta)
    ch1 = FAIDX(ch0).output
    ch2 = ch1.splitCsv(header:false,sep:'\t').map{it[0]}
    ch3 = APPLY_AUGUSTUS(ch1.combine(ch0).combine(ch2))
    MERGE(ch3.output.collect())
}


process FAIDX {
input:
    path(fasta)
output:
    tuple path("*.fai"),emit:output
script:
"""
samtools faidx ${fasta}
"""
}

process APPLY_AUGUSTUS {
input:
    tuple path(fai),path(fasta),val(contig)
output:
    path("${contig}.gff3"),emit:output
script:
"""
samtools faidx ${fasta} ${contig} > tmp.fa
augustus --species=maize --progress=true --gff3=on tmp.fa > "${contig}.gff3"
rm tmp.Fa
"""
}


process MERGE {
input:
    path(gff3)
output:
    path("final.output.gff3"),emit:output
script:
"""
cat ${gff3} > final.output.gff3
"""
}

ADD COMMENT • link 6 weeks ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you so much, Dr. Lindenbaum, for the valuable comment. I'm not quite experienced in nextflow, but I like to test this out. Can you provide some details about this code, or any resources to understand it?

ADD REPLY • link 6 weeks ago by Vijith ▴ 90

2

Entering edit mode

Start here: https://www.nextflow.io/

ADD REPLY • link 6 weeks ago by GenoMax 147k

score 2 · Accepted Answer · 2024-10-07

Instead of making a python script for splitting, you can use --block -1 --pipe-part --cat --recend "\n" --recstart ">":

parallel --block -1 -a big.fasta --pipepart --cat --recend "\n" --recstart ">" augustus [...] {}

This will automatically split the fasta file into 1 chunk per CPU thread. It will save the chunks into temporary files before calling augustus.

If augustus can read from stdin (e.g. by: augustus -) you can bypass generating the temporary files:

parallel --block -1 -a big.fasta --pipepart --recend "\n" --recstart ">" augustus [...] -

If augustus has very varying runtime, it might make sense to split big.fasta into more chunks, say, 3 per CPU thread: --block -3 This if a single chunk takes forever, then the other CPU threads will pick up the other chunks.