Question

Augustus Training Time

1

Entering edit mode

12.5 years ago

Daniel Standage 4.1k

I am working on assembling and annotating the genome of a non-model organism, and I have a set of about 3k genes from this genome that I am using to train my ab initio gene predictors. For Augustus, I am following the training procedure documented on this page. I converted the data to GenBank format and split the data into a training set and a test set, each containing 1.5k annotated sequences. After making the appropriate parameter/config files for this species, I launched the optimize_augustus.pl script with the 1.5k training sequences.

The page includes the caveat that this script likely has to run overnight. However, it has been going for over 2 days now and shows no sign of stopping. I'm guessing this is this taking so long because of the number of training sequences I have--the documentation recommends about 200 genes, whereas I have nearly 10 times that. Is this intuition correct? What runtimes have you had when training Augustus?

• 9.4k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 12.5 years ago by Daniel Standage 4.1k

0

Entering edit mode

Hi,

I am working Oryza sativa genome with genome size around 380Mb I have run augustus retraining since three week before still its under process,

Will you please let me know how long it will run ?

If possible suggest some multi-threading option to integrate in its training step to get it done asap..

Thanx

Amrinder

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by aaarsh88 • 0

0

Entering edit mode

Hello, I met the same problem that the training is still running for about 2weeks. Do you solve yours?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by zy041225 ▴ 70

0

Entering edit mode

As this is a separate question, it should have been posted as a new thread.

The only way to speed things up is to configure maker using MPI. It takes me about 6 days on 16 processors to finish one round on a ~150,000 scaffold ~2Gigabase vertebrate genome with protein evidence.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 9.7 years ago by mtollis ▴ 30

0

Entering edit mode

Dear there,

I'm using optimize_augustus.pl, with a training set of 1000 genes and the parameter -cpus=20, on a 650M genome, and for 5 rounds (default). One week have pass, all augustus processes have stopped except only one left on running with no sign to stop, and the nohup file really have gain no more information now.

What happened to your work afterwards? Can I share your experience here? Thanks a lot.

Sincerely,
Kang

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 8.0 years ago by dukecomeback ▴ 40

Ram · Answer 1 · 2012-06-07

Try the autoAug.pl script that comes with Augustus 2.5.5 in scripts:

autoAug.pl  --singleCPU --useexisting --genome=genome.fasta --species=speciesname --cdna=EST.fasta --trainingset=genome.gff3

We get the genome.gff3 training set from the output of a first-pass run of MAKER using:

EST data (if available, same file as above)
Proteins from related species
a SNAP model trained using CEGMA
a GeneMark model (obtained by running GeneMark.ES on the draft genome)
Running maker2zff on the output of MAKER, and converting that to GFF3 (Carson Holt's scripts are brilliant - this one ensures that it only picks up high(er) quality models from the prediction set

Yes, it takes a while. Two days sounds about right in singleCPU mode for a 100-200 Mb metazoan.

Once done, we run MAKER a second time using the Augustus model and more stringent settings.

Let me know if you need more details on any of these steps.