Augustus Training Time
1
1
Entering edit mode
12.5 years ago

I am working on assembling and annotating the genome of a non-model organism, and I have a set of about 3k genes from this genome that I am using to train my ab initio gene predictors. For Augustus, I am following the training procedure documented on this page. I converted the data to GenBank format and split the data into a training set and a test set, each containing 1.5k annotated sequences. After making the appropriate parameter/config files for this species, I launched the optimize_augustus.pl script with the 1.5k training sequences.

The page includes the caveat that this script likely has to run overnight. However, it has been going for over 2 days now and shows no sign of stopping. I'm guessing this is this taking so long because of the number of training sequences I have--the documentation recommends about 200 genes, whereas I have nearly 10 times that. Is this intuition correct? What runtimes have you had when training Augustus?

• 9.4k views
ADD COMMENT
0
Entering edit mode

Hi,

I am working Oryza sativa genome with genome size around 380Mb I have run augustus retraining since three week before still its under process,

Will you please let me know how long it will run ?

If possible suggest some multi-threading option to integrate in its training step to get it done asap..

Thanx

Amrinder

ADD REPLY
0
Entering edit mode

Hello, I met the same problem that the training is still running for about 2weeks. Do you solve yours?

ADD REPLY
0
Entering edit mode

As this is a separate question, it should have been posted as a new thread.

The only way to speed things up is to configure maker using MPI. It takes me about 6 days on 16 processors to finish one round on a ~150,000 scaffold ~2Gigabase vertebrate genome with protein evidence.

ADD REPLY
0
Entering edit mode

Dear there,

I'm using optimize_augustus.pl, with a training set of 1000 genes and the parameter -cpus=20, on a 650M genome, and for 5 rounds (default). One week have pass, all augustus processes have stopped except only one left on running with no sign to stop, and the nohup file really have gain no more information now.

What happened to your work afterwards? Can I share your experience here? Thanks a lot.

Sincerely,
Kang

ADD REPLY
0
Entering edit mode
12.5 years ago
Sujai Kumar ▴ 270

Try the autoAug.pl script that comes with Augustus 2.5.5 in scripts:

autoAug.pl  --singleCPU --useexisting --genome=genome.fasta --species=speciesname --cdna=EST.fasta --trainingset=genome.gff3

We get the genome.gff3 training set from the output of a first-pass run of MAKER using:

  1. EST data (if available, same file as above)
  2. Proteins from related species
  3. a SNAP model trained using CEGMA
  4. a GeneMark model (obtained by running GeneMark.ES on the draft genome)
  5. Running maker2zff on the output of MAKER, and converting that to GFF3 (Carson Holt's scripts are brilliant - this one ensures that it only picks up high(er) quality models from the prediction set

Yes, it takes a while. Two days sounds about right in singleCPU mode for a 100-200 Mb metazoan.

Once done, we run MAKER a second time using the Augustus model and more stringent settings.

Let me know if you need more details on any of these steps.

ADD COMMENT
0
Entering edit mode

I'm running MAKER on a non model organism as well where I only have alternative est data and no est data from the actual species. I was wondering then when running this autoAug.pl script whether I should omit the --cdna flag or use the est data from the alternative species?

ADD REPLY

Login before adding your answer.

Traffic: 1606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6