I am working on assembling and annotating the genome of a non-model organism, and I have a set of about 3k genes from this genome that I am using to train my ab initio gene predictors. For Augustus, I am following the training procedure documented on this page. I converted the data to GenBank format and split the data into a training set and a test set, each containing 1.5k annotated sequences. After making the appropriate parameter/config files for this species, I launched the optimize_augustus.pl script with the 1.5k training sequences.
The page includes the caveat that this script likely has to run overnight. However, it has been going for over 2 days now and shows no sign of stopping. I'm guessing this is this taking so long because of the number of training sequences I have--the documentation recommends about 200 genes, whereas I have nearly 10 times that. Is this intuition correct? What runtimes have you had when training Augustus?
Hello, I met the same problem that the training is still running for about 2weeks. Do you solve yours?
ADD REPLY
• link
updated 2.9 years ago by
Ram
44k
•
written 10.1 years ago by
zy041225
▴
70
0
Entering edit mode
As this is a separate question, it should have been posted as a new thread.
The only way to speed things up is to configure maker using MPI. It takes me about 6 days on 16 processors to finish one round on a ~150,000 scaffold ~2Gigabase vertebrate genome with protein evidence.
ADD REPLY
• link
updated 2.9 years ago by
Ram
44k
•
written 9.7 years ago by
mtollis
▴
30
0
Entering edit mode
Dear there,
I'm using optimize_augustus.pl, with a training set of 1000 genes and the parameter -cpus=20, on a 650M genome, and for 5 rounds (default). One week have pass, all augustus processes have stopped except only one left on running with no sign to stop, and the nohup file really have gain no more information now.
What happened to your work afterwards? Can I share your experience here? Thanks a lot.
We get the genome.gff3 training set from the output of a first-pass run of MAKER using:
EST data (if available, same file as above)
Proteins from related species
a SNAP model trained using CEGMA
a GeneMark model (obtained by running GeneMark.ES on the draft genome)
Running maker2zff on the output of MAKER, and converting that to GFF3 (Carson Holt's scripts are brilliant - this one ensures that it only picks up high(er) quality models from the prediction set
Yes, it takes a while. Two days sounds about right in singleCPU mode for a 100-200 Mb metazoan.
Once done, we run MAKER a second time using the Augustus model and more stringent settings.
Let me know if you need more details on any of these steps.
I'm running MAKER on a non model organism as well where I only have alternative est data and no est data from the actual species. I was wondering then when running this autoAug.pl script whether I should omit the --cdna flag or use the est data from the alternative species?
Hi,
I am working Oryza sativa genome with genome size around 380Mb I have run augustus retraining since three week before still its under process,
Will you please let me know how long it will run ?
If possible suggest some multi-threading option to integrate in its training step to get it done asap..
Thanx
Amrinder
Hello, I met the same problem that the training is still running for about 2weeks. Do you solve yours?
As this is a separate question, it should have been posted as a new thread.
The only way to speed things up is to configure maker using MPI. It takes me about 6 days on 16 processors to finish one round on a ~150,000 scaffold ~2Gigabase vertebrate genome with protein evidence.
Dear there,
I'm using
optimize_augustus.pl
, with a training set of 1000 genes and the parameter-cpus=20
, on a 650M genome, and for 5 rounds (default). One week have pass, all augustus processes have stopped except only one left on running with no sign to stop, and the nohup file really have gain no more information now.What happened to your work afterwards? Can I share your experience here? Thanks a lot.
Sincerely,
Kang