Hello all, I am working goat genome sample, completed the assembly and masking of genome. Can anyone suggest how to go do gene prediction since augustus doesnt have any ref goat for gene prediction.
Hello all, I am working goat genome sample, completed the assembly and masking of genome. Can anyone suggest how to go do gene prediction since augustus doesnt have any ref goat for gene prediction.
Hi, 1. easy way- use on line training here "http://bioinf.uni-greifswald.de/webaugustus/training/create" and use the paramaeter for you species.
you can prepare training data set from the goat refseq files from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Capra_hircus/latest_assembly_versions/GCF_001704415.1_ARS1/).
Notes: I used MAKER to annotate genome initially using RNA and Protein sequences from the closest species and the GFF output from this was used to get 1) multi-exonic 2) non-overlaping 3) relatively non-close sequences (<70% identity) transcripts. I have converted this GFF file (after filtration) to gb file format using "./scripts/gff2gbSmallDNA.pl" available in AUGUSTUS and used further according to the instructions given in above mentioned links (to be specific first link).
Hello Puli, I was trying to create the hints file from the transcripts from the ref. These were steps that i followed blat -minIdentity=92 genome.fa cdna.fa cdna.psl blat2hints.pl --in=cdna.psl --out=hints.E.gff augustus --species=human --hintsfile=hints.E.gff --extrinsicCfgFile=extrinsic.ME.cfg genome.fa
Maker pipeline i have not worked on that much. i would thankful if you provide some information on that.
Thank you
Hi,
You can find quite a lot of support information and support on how to run MAKER. To start with, you can follow the instructions at "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Improving_Annotation_Quality_with_MAKER.27s_AED_score" and one more place by darencard: maker_genome_annotation.md enter link description here General work flow for you will be:
Get the refseq (RNA and Protein) from the NCBI link that I have mentioned above.
Using MAKER align to de novo assembled genome by blanstn, blastx, est2genome (exonerate) and protein2genome(exonerate) (they will be called from MAKER and for proper installation follow the instructions given in the tutorial). This will generate GFF files for each contig/scaffold.
Merge GFFs using a script "./bin/gff3_merge" (you can find in the MAKER directory) and use this GFF for further filtration.
In filtration step: I suggest that first pick gene models with 5 prime and 3 prime UTRs that enriches good gene models for training. Then you can follow the further filtration steps. [if you face any problem with filtration after reaching to this stage write to me I can share AWK oneliners I have used for this purpose.]
Finally filtered GFF can be used for training AUGUSTUS.
Best
Hello Puli,
Maker i started and it is running from past 12 days. Why its taking so much time ? ( 12 threads i had given )
It created around 254 folders and each folders has > 100 subfolders. Total of > 35000 folders for each scaffold i guess.
I have around ~3gb genome and ~45k scaffolds.
I am just worrying why it is taking lot of time.
I am waiting for the result and then i ll merge the gff.
Can u share the awk one liner and training augustus offline afterwards workflow.
Hi,
It will take time (i guess mainly for running exonerate and blast). I used 50 threads for ~1gb genome and ~160K scaffolds and program finished in ~12 hrs. Its normal MAKER takes time (make sure it is using all threads; use "top" command and check). I will post one lines once you are done with running. Send me a header of merged GFF file.
Hello,
I think that maker is now on final step. Below is the out put what i m seeing.
cleaning blastx...
in cluster::shadow_cluster...
...finished clustering.
in cluster::shadow_cluster...
...finished clustering.
cleaning clusters....
total clusters:37 now processing 0
...processing 0 of 11
...processing 1 of 11
...processing 2 of 11
...processing 3 of 11
...
...processing 174 of 177
...processing 175 of 177
...processing 176 of 177
flattening protein clusters
prepare section files
Since i have given predicted proteins and transcripts can we use the maker out put as final result.
Can u tell me how can we filter out the Gff (final model) and then corresponding protein and gene model sequence.
Thank you in advance.
Hello,
I took little time but the gff was generated, Below is the look of my gff i used -g option for pulling out the gff.
##gff-version 3
###
###
Scaffold_21964|length1264 maker gene 741 1212 . + . ID=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0;Name=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0
Scaffold_21964|length1264 maker mRNA 741 1212 288 + . ID=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1;Parent=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0;Name=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1;_AED=0.05;_eAED=0.05;_QI=0|1|0.5|1|0|0|2|0|95
Scaffold_21964|length1264 maker exon 741 888 . + . ID=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1:1;Parent=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1
Scaffold_21964|length1264 maker exon 1073 1212 . + . ID=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1:2;Parent=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1
Scaffold_21964|length1264 maker CDS 741 888 . + 0 ID=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1:cds;Parent=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1
Scaffold_21964|length1264 maker CDS 1073 1212 . + 2 ID=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1:cds;Parent=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1
###
At the bottom of file fasta file ( assembled genome fasta is present) that should be removed. Individual gene or proteins sequences if we want to pull out like gff can we do that.??
#after running maker merged gff file was used and first round of filtration was carried out by AED score cutoff.
awk '{if($3=="mRNA")print$0}' ./base.all.gff |grep -v "trnascan"|grep "est2genome"| grep "mRNA-1"| grep "_AED=0.00"|grep "Name"|awk '{split($9,a,";");print a[2]}'|awk '{split($1,a,"=");print a[2]}'> ./for_Augustus/MAKER_filter1_list
grep -Fwf ./for_Augustus/MAKER_filter1_list ./base.all.gff > ./for_Augustus/MAKER_filter1.gff
#remove mono exonic genes
cat ./for_Augustus/MAKER_filter1.gff |awk '{split($9,a,";");print$3,a[2]}'|awk '{split($2,b,"=");print$1,b[2]}'| awk 'a ~ $0{print}; {a=$0}'|grep "^exon"| awk '{split($2,c,"-mRNA");print c[1]}'| sort -u > ./for_Augustus/non_monoExonic_MAKER_filter2.gff_list
grep -Fwf ./for_Augustus/non_monoExonic_MAKER_filter2.gff_list ./for_Augustus/MAKER_filter2.gff > ./for_Augustus/MAKER_filter3.gff
# get both three prime and five prime utr containing genes
cat ./for_Augustus/MAKER_filter3.gff |awk '/three_prime_UTR/ {print$0}' |awk '{split($9,a,";");print$3,a[2]}'|awk '{split($2,b,"=");print$1,b[2]}'| awk '{split($2,c,"-mRNA");print c[1]}'| sort -u > ./for_Augustus/three_prime_utr_positive_gene_list
cat ./for_Augustus/MAKER_filter3.gff |awk '/five_prime_UTR/ {print$0}' |awk '{split($9,a,";");print$3,a[2]}'|awk '{split($2,b,"=");print$1,b[2]}'| awk '{split($2,c,"-mRNA");print c[1]}'| sort -u > ./for_Augustus/five_prime_utr_positive_gene_list
sort ./for_Augustus/three_prime_utr_positive_gene_list ./for_Augustus/five_prime_utr_positive_gene_list |uniq -d > ./for_Augustus/five_three_prime_utr_positive_gene_list
grep -Fwf ./for_Augustus/five_three_prime_utr_positive_gene_list ./for_Augustus/MAKER_filter3.gff > ./for_Augustus/MAKER_filter4.gff
# remove utrs from gff (optional, but I will suggest for running without errors)
awk '!/three_prime_UTR|five_prime_UTR/ {print$0}' ./MAKER_filter4_mod.gff > ./MAKER_filter4_mod_withoutUTR.gff
#convert to gb format (scripts are in augustus)
./scripts/gff2gbSmallDNA.pl ./MAKER_filter4_mod_withoutUTR.gff /genome.fasta 10 ./MAKER_filter4_mod_withoutUTR.gb
# split test and training set (~1:5) (this is example, modify accordingly)
./scripts/randomSplit.pl ./MAKER_filter4_mod_withoutUTR.gb 200
grep -c LOCUS ./MAKER_filter4_mod_withoutUTR.gb*
./MAKER_filter4_mod_withoutUTR.gb:1279
./MAKER_filter4_mod_withoutUTR.gb.test:200
./MAKER_filter4_mod_withoutUTR.gb.train:1079
# create new file for your species (of interest)
./scripts/new_species.pl --species=species
#etrain
./bin/etraining --species=species ./MAKER_filter4_mod_withoutUTR.gb.train
#test
./bin/augustus --species=species ./MAKER_filter4_mod_withoutUTR.gb.test | tee firsttest.out
#optimize
./scripts/optimize_augustus.pl --cpus=10 --species=species ./MAKER_filter4_mod_withoutUTR.gb.train --metapars=./config/species/species/species_metapars.cfg
#busco
./scripts/run_BUSCO.py -c 10 --long -sp species -o species_test -i /genome.fasta -l /BUSCO/busco/data/eukaryota_odb9 -m genome
Let me know if you face any problem with running these lines.
Best
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi,
First, Are you trying to predict genes in ab-initio, meaning you are trying to predict genes for the first time? Have you tried to run Augustus?
In augustus command line,
you need to select a species name in species directory of augustus. The species must be a close species to your species. Later, you will get a gene prediction results. Pick some scaffolds and their .gff file and have a look at Artemis tool to check whether augustus predicted genes correctly (check exon and intron boundaries) and number of genes of your species. And also compare gene number with species that are close to your species.
Later, run augustus training step. at the step you will generate gene prediction parameters for your own species. Using new parameters, run new gene prediction. In the new prediction, use your own gene parameters and check results in Artemis. Keep doing training step until you get best gene prediction results (gene sets).
and let me know you need help.
Hello Mehmet, Thank you for your reply, I have used augustus many times for other genomes like rice, honeybee. I have question over here. Q 1. Which close species should be considered for running augustus( my sample is goat). Q 2 . Training step do i have to use the output from first step or wt is exct way. Thank you for your kind help.
Hi mks002,
Q 1. Which close species should be considered for running augustus( my sample is goat).
A1: You can have a look species list of augustus, and compare them to find which is closely related to goat species. For instance, have a look a paper that shows phylogenetic relationships of goat species with other species and check which one or ones are listed in species list of augustus.
Q 2 . Training step do i have to use the output from first step or wt is exct way. A2: After finding closely species to goat species, you need to use its parameters in augustus. Then, you will get a gene prediction results. Using the results, you need to start training step. In the training step, you need to use a new parameter (for instance, goat1.). Then, start a new gene prediction using goat1 as a species in species argument in augustus. After finishing new gene prediction with goat1 parameter, have a look results on Artemis. Repeat training step until you get best gene models. In each new gene prediction, use new species parameter. For instance, goat2, goat3, goat4, ...so on. According to my personal experience, you need to do at least 10 training step. Consider that goat genome should be more complex.
Please write if something is not clear or you need additional help. I can guide you.
Thank you for the detailed response.
1 . I went through some papers and then checked the closest species in augustus, human is what is only closest i observed.
2 . Say i predicted the genes first time using humans with human parameters (on default) it will do gene prediction. Using this result what new parameters should be considered (for instance goat 1). Basically what should be going in for training as input files and with what format. Also what would be the training commands.
OK. Just let me know when you finish the first gene prediction using human as species parameters. Later, I will guide you step by step.
Hello, Completed the gene prediction with the closest reference. Predicted ~40 k gene , which is close to one of the genes in one of the goat reference genome. should i go with steps. Thanks in advance
Hi,
Good. First you need to check gene features of your species with the reference species. For this,
Use eval.2.8.8 tool. Please download and install it. In the eval tool, there is a script get_general_stats.pl. This script uses .gtf files as input. You need to convert gff file produced by augustus. For this, use gffread command in cufflinks suite. For example, gffread yourfile.gff -T -o yourfile.gtf. Later, use yourfile.gtf and referencegoat.gtf in get_general_stats.pl script. Compare results. Later, we will start training step. Please let me know when you finish.
Hello , Thank you for your extended help. The results have been compared and observed stats as below
Description -------------1st column (sample goat)----------2nd column (ref goat)
Gene All stats
Transcript all stats
Exon All stats
Command i used get_general_stats.pl -q -A Sample_file.gtf ref_file.gtf
Hi Can you modify these results? I can not see very well. Please show parameters below:
number of genes number of proteins average exon number average intron number median exon median intron exon per gene etc.
Please reformat these results.
first gene prediction looks good based on the stats. Now you need to check other features. For this;
Please install Artemis tool from http://www.sanger.ac.uk/science/tools/artemis
Later, load genome.fa and bam file (from Tophat) to Artemis with gff file.
Please 1) Check intron-exon boundaries to see RNA reads should not cover intron, should cover only exons. 2) Check start and stop codons on genes.
have a look this tutorial:
https://img.jgi.doe.gov/docs/docs/genome-qa.pdf
You should notice that gene prediction is a time consuming process. You need to get best gene models to move to next step of your work.
Let me know when you finish. We need to perform manual gene curation and using curated genes we will start augustus training step.
For Artemis which bam file to upload since no alignment was performed. For this assembly Rna-seq data is not available. So we can go ahead ? what would be feasible solution.
This might give less reliable results if you do not have RNA-seq data. RNA-seq data is necessary to confirm gene predictions. Alternatively, you can check gene predictions using blast. For this, open fasta and gff file in artemis. Then, click on each gene and run blast to look at possible matches.
Try to run gene prediction using other gene prediction tools. Later, use EVM tool (https://evidencemodeler.github.io)
Now since Augustus gene models are already there can we go ahead with training steps.
Since Genome size ~3 gb it is quite time taking to go through all the processes. Can we do one step of training the augustus and work on it.
Artemis loading this big genome also is great trouble (error in loading automatically closing down). If it is just for validation and visualization can work ahead. Regards
You can select the first and second scaffolds of your genome and load them to Artemis. The first and second scaffolds are the biggest ones in your genome. My concerns are to make sure about your gene sets, as your downstream analyses absolutely depend on good gene sets.This is why I want you to get best gene sets.Lack of RNA-seq data is a big problem in producing reliable gene sets, because you will need to perform gene prediction based on hints files (using exon, intron, and intron+exon parameters).
For augustus training; you need about 700 curated genes, of which 500 genes for training and 200 genes test (have a look (ftp://188.44.46.157/New/augustus.2.7/docs/tutorial/training.html) and (http://avrilomics.blogspot.com.tr/2013/04/training-augustus-gene-finding-software.html) What you need to do are: 1. Pick the first and second scaffolds and load them to artemis separately with their corresponding gff file. 2. Check each gene using blast to confirm good matches in blast database, and delete genes that do not have good matches or have zero matches. 3. Keeps only genes that have good matches in blast searches. 4. Save each scaffold as genbank format (.gb) in artemis.
and let me know if you need.
I have >100Mb some 10 scaffolds. And picking one one gene would not be feasible i think need to rework some other way.