Question

Status of maker annotation

0

Entering edit mode

8.5 years ago

Prakki Rama ★ 2.7k

Hi all,

May I know if there is any tweak/trick to know the status of genome annotation using maker? Annotation of 1 gb genome using maker has been running for past two months on 24 cores. May I know if there is a way to speed up the annotation for larger genomes? Please share your comments and valuable suggestions.

genome annotation maker • 3.7k views

ADD COMMENT • link updated 7 weeks ago by sansan96 ▴ 130 • written 8.5 years ago by Prakki Rama ★ 2.7k

score 3 · Accepted Answer · 2016-05-17

3

Entering edit mode

8.5 years ago

Juke34 8.9k

You can have an idea of the status of the annotation by comparing the number of finished contig in the log file with the number of contig of your assembly.

The biggest part of the computational time is used by the alignment of the evidences in fasta format. So, to speed up the annotation you may reduce the data amont used to feed MAKER. What you can do also, is to perform the splice aware alignment outside MAKER with more performant tool (e.g GMAP) and feed up MAKER with the gff produced.

ADD COMMENT • link 8.5 years ago by Juke34 8.9k

0

Entering edit mode

Hi. Thank you for suggestions. They were really helpful. Some of the genome scaffolds were showing FAILED status. Should these be extracted and run maker again seperately and finally merge the gff files at the end. Thanks in advance for your suggestions.

ADD REPLY • link 8.5 years ago by Prakki Rama ★ 2.7k

1

Entering edit mode

By default MAKER try to annotate a chunck two times. After these two tries it skip it. No need to extract the failed sequences. When the MAKER run is finished you can just relaunch it whith the parameter that allows MAKER to retry to annotate the failed sequences more than 2 times. Either whith the option "-t 6" (6 means try six times), or modifying the retry option whitin the maker_opts file (around the end of the file).

ADD REPLY • link 8.5 years ago by Juke34 8.9k

0

Entering edit mode

Thank you. I will try that!

edit: it was useful thank you!

ADD REPLY • link 8.4 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

I could check the number of finished chromosomes/contigs from the data store folder using the following command:

ls -ltrh */*/*/*.fasta | awk '{print $NF}'| awk -F'/' '{print $3}' | sort -u | wc -l

ADD REPLY • link 7.6 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Hello Juke34 ,

I hope you can help me. I'm using MAKER for the first time to annotate a plant genome, however I have a couple of questions that I think are pretty simple. First, I made a pre-identified repeat elements with RepeatModeler and RepeatMasker through EarlGrey TE. The generated gff3 (plant_scaffolds.full_mask.complex.reformat.gff3 ) was put into "rm_gff" along with transcripts assembled with trinity and proteins, however, I'm confused on which genome I should add in each round of annotation. Should I use the softmask genome in each round of maker or the unmasked one (original)?

Thank so much.

ADD REPLY • link 12 weeks ago by sansan96 ▴ 130

1

Entering edit mode

You should not use a masked genome if you provide a repeat library or the repeat in GFF format. Maker will mask it for you. If needed it is possible to retrieve the masked genome at any time after running MAKER using maker_get_rm_genome.pl provided by MAKER.

ADD REPLY • link 12 weeks ago by Juke34 8.9k

0

Entering edit mode

Hello Juke34

Thank you very much for your reply. I have another question, I hope I'm not bothering you. Is it possible to make an annotation with maker by adding the softmask genome? What configuration would you recommend?

I'm doing a first attempt using rm_gff, protein and est and non-softmask genome but I haven't been successful, I'm getting a lot of FAILED, I'm thinking my gff (rm_gff) file has some error and I'd like to try another way:

plant_round1_normal_master_datastore_index.log

scaffold_1      plant_round1_normal_datastore/49/CD/scaffold_1/    STARTED
scaffold_1      plant_round1_normal_datastore/49/CD/scaffold_1/    FINISHED
scaffold_2      plant_round1_normal_datastore/87/E3/scaffold_2/    STARTED
scaffold_2      plant_round1_normal_datastore/87/E3/scaffold_2/    FINISHED
scaffold_3      plant_round1_normal_datastore/47/19/scaffold_3/    STARTED
scaffold_3      plant_round1_normal_datastore/47/19/scaffold_3/    FINISHED
....
scaffold_9      plant_round1_normal_datastore/F3/F3/scaffold_9/    STARTED
scaffold_9      plant_round1_normal_datastore/F3/F3/scaffold_9/    FAILED
scaffold_10     plant_round1_normal_datastore/C3/86/scaffold_10/   STARTED
scaffold_10     plant_round1_normal_datastore/C3/86/scaffold_10/   FAILED
....
scaffold_4000     plant_round1_normal_datastore/3A/5C/scaffold_13/   STARTED
scaffold_4000     plant_round1_normal_datastore/3A/5C/scaffold_13/   FAILED

My maker_opts.ctl is:

#-----Genome (these are always required)
genome=/primary/Acam_primary_round3_sort.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est=/primary/anotacion_plant/maker_anotacion/evidence/Trinity_90.fasta #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/primary/anotacion_plant/maker_anotacion/evidence/Viridiplantae.fa  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=simple #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/data/software/maker-2.31/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff=/EarlGrey/Acam_primary_round3_sort.fasta.prep.out.complex.reformat.gff3 #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=1 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP=/scratch #specify a directory other than the system default temporary directory for temporary files

ADD REPLY • link 7 weeks ago by sansan96 ▴ 130