Hi, I am working on annotation of plant genome recently. I choose the AUGUSTUS to predict genes. I see the document of training sets.But I can't understand it.
Firstly, the protocol of "retraining AUGUSTUS" needs a training set,a test set and A META PARAMETERS. Are the training set or the test set completely sequeces? How can I get it ? From NCBI? And How can I configure the file(*.cfg) in META PARAMETERS?
Secondly, The file hints, How does it come from or generate?
Does the "retraining AUGUSTUS" and the hints have some relationship between them?
The training set is a file of genes in genbank format to use for training. The test set is also a file of genes in genbank format that you may use to assess the quality of the training. The meta parameters are various parameters used by AUGUSTUS for prediction.
You must choose your own training and test set of genes. The "retraining AUGUSTUS" page suggests a number of possible sources:
Genbank
Spliced alignments of ESTs against the assembled genomic sequence. e.g. PASA
Spliced alignments of protein sequences of a related species against the assembled genomic sequence, e.g. GeneWise
Data from a related species
Iterate retraining with predicted genes
The meta parameters should be based on the generic ones that come with AUGUSTUS in generic_parameters.cfg and generic_weightmatrix.cfg
Training AUGUSTUS can seem intimidating at first, but if you follow the retraining document it is reasonably straightforward. In particular, the steps in the section 3. RUN THE SCRIPT optimize_augustus.pl are easy to follow.
I'm using optimize_augustus.pl, with a training set of 1000 genes and the parameter -cpus=20, on a 650M genome, and for 5 rounds (default). One week have pass, all augustus processes have stopped except only one left on running with no sign to stop, and the nohup file really have gain no more information now.
It's quite a dilemma to me now, can you give me some advice. Thanks.
Dear David,
I'm using optimize_augustus.pl, with a training set of 1000 genes and the parameter -cpus=20, on a 650M genome, and for 5 rounds (default). One week have pass, all augustus processes have stopped except only one left on running with no sign to stop, and the nohup file really have gain no more information now.
It's quite a dilemma to me now, can you give me some advice. Thanks.