Entering edit mode
7.7 years ago
qwzhang0601
▴
80
Hello:
I am trying to use Busco to train Augustus for a new genome. I am training it on 20 nodes and it has been running for about 20 days. I wonder whether there is some information that I can used to estimate how long will it take for the training.
After I get the following information, I do not have updated information. Can I estimated how long will it take?
WARNING 02/17/2017 19:23:40 => Optimizing augustus metaparameters, this may take a very long time...
#######
#below is the detail message from Busco
INFO ****************** Start a BUSCO 2.0 analysis, current time: 02/17/2017 13:29:03 ******************
INFO The lineage dataset is: mammalia_odb9 (eukaryota)
INFO Mode is: genome
INFO Maximum number of regions limited to: 3
INFO To reproduce this run: python /public/apps/busco/v2.0/python.2.7.8/BUSCO.py -i ../rawData/CasCan.a.10000.fasta -o Beaver -l /public/apps/busco/v2.0/python.2.7.8/mammalia_
odb9/ -m genome -c 20 --long -sp human
INFO Check dependencies...
INFO Check input file...
INFO Temp directory is ./tmp/
INFO ****** Phase 1 of 2, initial predictions ******
INFO ****** Step 1/3, current time: 02/17/2017 13:29:18 ******
INFO Create blast database...
INFO [makeblastdb] Building a new DB, current time: 02/17/2017 13:29:19
INFO [makeblastdb] New DB name: /gs/gsfs0/users/qzhang/Beaver/data/train_augustus_busco/tmp/Beaver_1784058097
INFO [makeblastdb] New DB title: ../rawData/CasCan.a.10000.fasta
INFO [makeblastdb] Sequence type: Nucleotide
INFO [makeblastdb] Keep Linkouts: T
INFO [makeblastdb] Keep MBits: T
INFO [makeblastdb] Maximum file size: 1000000000B
INFO [makeblastdb] Adding sequences from FASTA; added 10000 sequences in 14.3675 seconds.
INFO Running tblastn, writing output to /gs/gsfs0/users/qzhang/Beaver/data/train_augustus_busco/run_Beaver/blast_output/tblastn_Beaver.tsv...
INFO ****** Step 2/3, current time: 02/17/2017 14:13:07 ******
INFO Getting coordinates for candidate regions...
INFO Pre-Augustus scaffold extraction...
INFO Running Augustus prediction using human as species:
INFO [augustus] Please find all logs related to Augustus here: /gs/gsfs0/users/qzhang/Beaver/data/train_augustus_busco/run_Beaver/augustus_output/augustus.log
INFO 02/17/2017 14:13:24 => 0% of predictions performed (4453 to be done)
INFO 02/17/2017 14:21:55 => 10% of predictions performed (490/4453 candidate regions)
INFO 02/17/2017 14:30:23 => 20% of predictions performed (936/4453 candidate regions)
INFO 02/17/2017 14:38:25 => 30% of predictions performed (1381/4453 candidate regions)
INFO 02/17/2017 14:46:05 => 40% of predictions performed (1826/4453 candidate regions)
INFO 02/17/2017 14:53:21 => 50% of predictions performed (2272/4453 candidate regions)
INFO 02/17/2017 15:00:53 => 60% of predictions performed (2717/4453 candidate regions)
INFO 02/17/2017 15:08:04 => 70% of predictions performed (3162/4453 candidate regions)
INFO 02/17/2017 15:15:06 => 80% of predictions performed (3607/4453 candidate regions)
INFO 02/17/2017 15:23:43 => 90% of predictions performed (4053/4453 candidate regions)
INFO 02/17/2017 15:31:05 => 100% of predictions performed
INFO Extracting predicted proteins...
INFO ****** Step 3/3, current time: 02/17/2017 15:32:45 ******
INFO Running HMMER to confirm orthology of predicted proteins:
INFO 02/17/2017 15:32:45 => 0% of predictions performed (4238 to be done)
INFO 02/17/2017 15:32:51 => 10% of predictions performed (468/4238 candidate proteins)
INFO 02/17/2017 15:33:02 => 20% of predictions performed (891/4238 candidate proteins)
INFO 02/17/2017 15:33:21 => 30% of predictions performed (1315/4238 candidate proteins)
INFO 02/17/2017 15:33:47 => 40% of predictions performed (1739/4238 candidate proteins)
INFO 02/17/2017 15:34:20 => 50% of predictions performed (2163/4238 candidate proteins)
INFO 02/17/2017 15:35:00 => 60% of predictions performed (2586/4238 candidate proteins)
INFO 02/17/2017 15:35:47 => 70% of predictions performed (3009/4238 candidate proteins)
INFO 02/17/2017 15:36:41 => 80% of predictions performed (3433/4238 candidate proteins)
INFO 02/17/2017 15:37:43 => 90% of predictions performed (3857/4238 candidate proteins)
INFO 02/17/2017 15:38:39 => 100% of predictions performed
INFO Results:
INFO C:51.0%[S:50.5%,D:0.5%],F:10.1%,M:38.9%,n:4104
INFO 2094 Complete BUSCOs (C)
INFO 2073 Complete and single-copy BUSCOs (S)
INFO 21 Complete and duplicated BUSCOs (D)
INFO 413 Fragmented BUSCOs (F)
INFO 1597 Missing BUSCOs (M)
INFO 4104 Total BUSCO groups searched
INFO ****** Phase 2 of 2, predictions using species specific training ******
INFO ****** Step 1/3, current time: 02/17/2017 15:38:40 ******
INFO Extracting missing and fragmented buscos from the ancestral_variants file...
INFO Running tblastn, writing output to /gs/gsfs0/users/qzhang/Beaver/data/train_augustus_busco/run_Beaver/blast_output/tblastn_Beaver_missing_and_frag_rerun.tsv...
INFO Getting coordinates for candidate regions...
INFO ****** Step 2/3, current time: 02/17/2017 18:40:21 ******
INFO Training Augustus using Single-Copy Complete BUSCOs:
INFO 02/17/2017 18:40:22 => Converting predicted genes to short genbank files...
INFO 02/17/2017 19:23:33 => All files converted to short genbank files, now running the training scripts...
WARNING 02/17/2017 19:23:40 => Optimizing augustus metaparameters, this may take a very long time...
Hello !
I did not clearly understand if you want to know how long will last BUSCO or Augustus calling BUSCO or just the optimization part
I want to know whether I can estimate when the Busco can finish. Then at that time I can use the trained Augustus to annotate a new genome. Since Busco has been running for 20 days, I just afraid I will have to wait for a really long time. In that case, I have to find another solution rather than waiting.
Thanks
Woh, 20 days, that's impressive ! Are you using Busco v1 or v2 ?
Anyway, using either one or an other, my runs never last more than one hour using several cores (~30 cores).
I'm still not getting, sorry, if you are launching BUSCO alone or within a kind of wrapper that is part of the training for Augustus ? I'm not well aware of Augustus raining using Busco that's why I'm a bit confused.
Are you launching Busco independently in a terminal ?
I used v2. It costs a long time because I used the --long parameter, which will turn on Augustus optimization mode for self-training.
Thanks
As long as a program is running (consuming CPU cycles in
top
etc) there is not much you can do but have patience. But if you have seen error messages, output files that are no longer growing then you may want to consider aborting.Have you tried running it without the
--long
? In only takes a few hours without--long
in my experienceThanks. No, according to their manual --long parameter is valuable when using BUSCO sets to train gene predictors. Since my goal is to train Augustus on a new genome, I use this parameter. If I have to wait another 20 days or even longer, maybe I have to ignore --long parameter. So I wonder whether I can predict how long it will take, then I can make a plan. The program seems still running, and it seems the Augustus model for the new genome was updated early this morning.
Were you able to find some workaround for it. I am having the same issue
UPDATE:
for me with around ~63k scaffolds, it took ~2 days with 1 CPU. My problem is that this script fails at my end with multiple cores given with
-c
option.Finally, I found we need to install the perl module “Parallel::ForkManager” to run optimize_augustus.pl in parallel. You can look at the "augustus.log" file, and check if you have the same problem.
I found the following information in the "augustus.log" file, so I installed the "Parallel::ForkManager", and after that it run much faster. ..... Writing exon model parameters [1] to file /gs/gsfs0/users/qzhang/tools/augustus3.2.3/config/species/BUSCO_NMR_100kb_Long_2426001641/BUSCO_NMR_100kb_L ong_2426001641_exon_probs.pbl. The perl module Parallel::ForkManager is required to run optimize_augustus.pl in parallel. Install this module first. On Ubuntu linux install with sudo apt-get install libparallel-forkmanager-perl Will now run sequentially (--cpus=1)...
How many threads (
-c N
or--cpu N
) are you using? if you have large number of processors then set this accordingly, it will greatly increase the speed (default is one, I think)FYI, none of my BUSCO runs (even for genomes >2.5Gbp) ever took more than 12hrs run time (with 16 procs), but I had never ran with
--long
option too.It has been quite sometime now since the last post on this thread. I don't know if it would be relevant now but just in case if somebody landed on this post, I want to share the experience.
Interesting, in the BUSCO training mode using the
--long
option, it failed for me with the-c
option, no matter how many threads I provide. So, in my case I didn't use the-c
option and hence it ran on 1 thread and took ~2 days.In the non-training mode i.e. evaluation mode, the
-c
did work and this time I provided 55 threads and it completed in an hour or so. It was way faster because of the fact that this time I was not training augustus.