Hi all,
I am trying to train a model for gene prediction of a non model plant species using the data set from arabidopsis thaliana. I am referring this tutorial and trying to follow the steps:
Steps followed so far:
(1) Download arabidopsis data, as provided by this tutorial; this is an example set:
wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.gbff.gz
(2) Randomly split the set of annotated sequences in a training and a test set.
randomSplit.pl GCF_000001735.3_TAIR10_genomic.gbff 4
NOTE: I know that 4 is extremely low number and that there should be at least 200 genes to be used as a training set; I am trying to see what all steps needs to be executed before I run the same with actual data set.
(3) Create the files for training "my_genome" from a template.
new_species.pl --species=my_genome
(4) Make initial training set
etraining --species=my_genome GCF_000001735.3_TAIR10_genomic.gbff.train
Error encountered at this step which say:
Constructing GenBank feature: Feature begins after it ends: 9388571,9389420..9390450
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
Constructing GenBank feature: Feature begins after it ends: 1828296,1828395..1828689,1829291..1829438,1829624..1830211
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
CDS contains character c
GBProcessor::getGeneList(): GBProcessor::getJoin( ): failed!!!
Encountered error after reading 0 annotations.
/augustus-3.2.3/bin/etraining: ERROR
No genbank sequences found.
Question:
I am just running the demo data set which is expected to run without any issue. The message CDS contains character c
is quite confusing. Any clues ?
EDIT 1: There are indeed sequences in the genbank file
grep "^LOCUS" GCF_000001735.3_TAIR10_genomic.gbff* -c
GCF_000001735.3_TAIR10_genomic.gbff:7
GCF_000001735.3_TAIR10_genomic.gbff.test:4
GCF_000001735.3_TAIR10_genomic.gbff.train:3
Hi,
I am having the same problem, did you already figure out how to solve it?
Thank you so much in advance,
Cristina Osuna
Hi Cristina
No, the problem remains the same. What is your organism? What files do you have?
~Vijay
Hi, I am getting the same problem, can you please help me out if you had solved it?
Unfortunately, I could not
I have done the augustus training a little bit different so working now. thanks!!!!
Hi, I am currently annotating some genomes. I have the same problem you had. I know your post is a little dated, but do you remember how you solved it? Thanks :)