Goat Genome gene prediction
1
1
Entering edit mode
7.2 years ago
mks002 ▴ 220

Hello all, I am working goat genome sample, completed the assembly and masking of genome. Can anyone suggest how to go do gene prediction since augustus doesnt have any ref goat for gene prediction.

sequencing next-gen genome • 4.0k views
ADD COMMENT
0
Entering edit mode

Hi,

First, Are you trying to predict genes in ab-initio, meaning you are trying to predict genes for the first time? Have you tried to run Augustus?

In augustus command line,

augustus [parameters] --species=SPECIES queryfilename

you need to select a species name in species directory of augustus. The species must be a close species to your species. Later, you will get a gene prediction results. Pick some scaffolds and their .gff file and have a look at Artemis tool to check whether augustus predicted genes correctly (check exon and intron boundaries) and number of genes of your species. And also compare gene number with species that are close to your species.

Later, run augustus training step. at the step you will generate gene prediction parameters for your own species. Using new parameters, run new gene prediction. In the new prediction, use your own gene parameters and check results in Artemis. Keep doing training step until you get best gene prediction results (gene sets).

and let me know you need help.

ADD REPLY
0
Entering edit mode

Hello Mehmet, Thank you for your reply, I have used augustus many times for other genomes like rice, honeybee. I have question over here. Q 1. Which close species should be considered for running augustus( my sample is goat). Q 2 . Training step do i have to use the output from first step or wt is exct way. Thank you for your kind help.

ADD REPLY
1
Entering edit mode

Hi mks002,

Q 1. Which close species should be considered for running augustus( my sample is goat).

A1: You can have a look species list of augustus, and compare them to find which is closely related to goat species. For instance, have a look a paper that shows phylogenetic relationships of goat species with other species and check which one or ones are listed in species list of augustus.

Q 2 . Training step do i have to use the output from first step or wt is exct way. A2: After finding closely species to goat species, you need to use its parameters in augustus. Then, you will get a gene prediction results. Using the results, you need to start training step. In the training step, you need to use a new parameter (for instance, goat1.). Then, start a new gene prediction using goat1 as a species in species argument in augustus. After finishing new gene prediction with goat1 parameter, have a look results on Artemis. Repeat training step until you get best gene models. In each new gene prediction, use new species parameter. For instance, goat2, goat3, goat4, ...so on. According to my personal experience, you need to do at least 10 training step. Consider that goat genome should be more complex.

Please write if something is not clear or you need additional help. I can guide you.

ADD REPLY
0
Entering edit mode

Thank you for the detailed response.

1 . I went through some papers and then checked the closest species in augustus, human is what is only closest i observed.

2 . Say i predicted the genes first time using humans with human parameters (on default) it will do gene prediction. Using this result what new parameters should be considered (for instance goat 1). Basically what should be going in for training as input files and with what format. Also what would be the training commands.

ADD REPLY
1
Entering edit mode

OK. Just let me know when you finish the first gene prediction using human as species parameters. Later, I will guide you step by step.

ADD REPLY
0
Entering edit mode

Hello, Completed the gene prediction with the closest reference. Predicted ~40 k gene , which is close to one of the genes in one of the goat reference genome. should i go with steps. Thanks in advance

ADD REPLY
0
Entering edit mode

Hi,

Good. First you need to check gene features of your species with the reference species. For this,

Use eval.2.8.8 tool. Please download and install it. In the eval tool, there is a script get_general_stats.pl. This script uses .gtf files as input. You need to convert gff file produced by augustus. For this, use gffread command in cufflinks suite. For example, gffread yourfile.gff -T -o yourfile.gtf. Later, use yourfile.gtf and referencegoat.gtf in get_general_stats.pl script. Compare results. Later, we will start training step. Please let me know when you finish.

ADD REPLY
0
Entering edit mode

Hello , Thank you for your extended help. The results have been compared and observed stats as below


Description -------------1st column (sample goat)----------2nd column (ref goat)


Gene All stats


    Count                           42085.00    28921.00  
    Total Transcripts               42085.00    53256.00  
    Transcripts Per                 1.00        1.84

Transcript all stats


    Count                           42085.00    53256.00  
    Average Length                  27120.43    57710.08  
    Median Length                   1749.00     17910.00  
    Total Length                    1141363275.00   3073407954.00

Exon All stats


    Count                           231778.00   208365.00 
    Average Length                  213.96      176.53    
    Median Length                   128.00      124.00    
    Total Length                    49590387.00     36783651.00

Command i used get_general_stats.pl -q -A Sample_file.gtf ref_file.gtf

ADD REPLY
0
Entering edit mode

Hi Can you modify these results? I can not see very well. Please show parameters below:

number of genes number of proteins average exon number average intron number median exon median intron exon per gene etc.

Please reformat these results.

ADD REPLY
0
Entering edit mode

first gene prediction looks good based on the stats. Now you need to check other features. For this;

Please install Artemis tool from http://www.sanger.ac.uk/science/tools/artemis

Later, load genome.fa and bam file (from Tophat) to Artemis with gff file.

Please 1) Check intron-exon boundaries to see RNA reads should not cover intron, should cover only exons. 2) Check start and stop codons on genes.

have a look this tutorial:

https://img.jgi.doe.gov/docs/docs/genome-qa.pdf

You should notice that gene prediction is a time consuming process. You need to get best gene models to move to next step of your work.

Let me know when you finish. We need to perform manual gene curation and using curated genes we will start augustus training step.

ADD REPLY
0
Entering edit mode

For Artemis which bam file to upload since no alignment was performed. For this assembly Rna-seq data is not available. So we can go ahead ? what would be feasible solution.

ADD REPLY
0
Entering edit mode

This might give less reliable results if you do not have RNA-seq data. RNA-seq data is necessary to confirm gene predictions. Alternatively, you can check gene predictions using blast. For this, open fasta and gff file in artemis. Then, click on each gene and run blast to look at possible matches.

Try to run gene prediction using other gene prediction tools. Later, use EVM tool (https://evidencemodeler.github.io)

ADD REPLY
1
Entering edit mode

Now since Augustus gene models are already there can we go ahead with training steps.

Since Genome size ~3 gb it is quite time taking to go through all the processes. Can we do one step of training the augustus and work on it.

Artemis loading this big genome also is great trouble (error in loading automatically closing down). If it is just for validation and visualization can work ahead. Regards

ADD REPLY
0
Entering edit mode

You can select the first and second scaffolds of your genome and load them to Artemis. The first and second scaffolds are the biggest ones in your genome. My concerns are to make sure about your gene sets, as your downstream analyses absolutely depend on good gene sets.This is why I want you to get best gene sets.Lack of RNA-seq data is a big problem in producing reliable gene sets, because you will need to perform gene prediction based on hints files (using exon, intron, and intron+exon parameters).

For augustus training; you need about 700 curated genes, of which 500 genes for training and 200 genes test (have a look (ftp://188.44.46.157/New/augustus.2.7/docs/tutorial/training.html) and (http://avrilomics.blogspot.com.tr/2013/04/training-augustus-gene-finding-software.html) What you need to do are: 1. Pick the first and second scaffolds and load them to artemis separately with their corresponding gff file. 2. Check each gene using blast to confirm good matches in blast database, and delete genes that do not have good matches or have zero matches. 3. Keeps only genes that have good matches in blast searches. 4. Save each scaffold as genbank format (.gb) in artemis.

and let me know if you need.

ADD REPLY
0
Entering edit mode

I have >100Mb some 10 scaffolds. And picking one one gene would not be feasible i think need to rework some other way.

ADD REPLY
2
Entering edit mode
7.2 years ago

Hi, 1. easy way- use on line training here "http://bioinf.uni-greifswald.de/webaugustus/training/create" and use the paramaeter for you species.

you can prepare training data set from the goat refseq files from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Capra_hircus/latest_assembly_versions/GCF_001704415.1_ARS1/).

  1. You can follow the instructions given in the following links for making custom files and training AUGUSTUS. "https://vcru.wisc.edu/simonlab/bioinformatics/programs/augustus/docs/tutorial2015/training.html" "http://avrilomics.blogspot.in/2013/04/training-augustus-gene-finding-software.html" you can use refseq RNA and protein files from the NCBI link for getting RNA and PROTEIN files.

Notes: I used MAKER to annotate genome initially using RNA and Protein sequences from the closest species and the GFF output from this was used to get 1) multi-exonic 2) non-overlaping 3) relatively non-close sequences (<70% identity) transcripts. I have converted this GFF file (after filtration) to gb file format using "./scripts/gff2gbSmallDNA.pl" available in AUGUSTUS and used further according to the instructions given in above mentioned links (to be specific first link).

ADD COMMENT
0
Entering edit mode

Hello Puli, I was trying to create the hints file from the transcripts from the ref. These were steps that i followed blat -minIdentity=92 genome.fa cdna.fa cdna.psl blat2hints.pl --in=cdna.psl --out=hints.E.gff augustus --species=human --hintsfile=hints.E.gff --extrinsicCfgFile=extrinsic.ME.cfg genome.fa

Maker pipeline i have not worked on that much. i would thankful if you provide some information on that.

Thank you

ADD REPLY
0
Entering edit mode

Hi,

You can find quite a lot of support information and support on how to run MAKER. To start with, you can follow the instructions at "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Improving_Annotation_Quality_with_MAKER.27s_AED_score" and one more place by darencard: maker_genome_annotation.md enter link description here General work flow for you will be:

  1. Get the refseq (RNA and Protein) from the NCBI link that I have mentioned above.

  2. Using MAKER align to de novo assembled genome by blanstn, blastx, est2genome (exonerate) and protein2genome(exonerate) (they will be called from MAKER and for proper installation follow the instructions given in the tutorial). This will generate GFF files for each contig/scaffold.

  3. Merge GFFs using a script "./bin/gff3_merge" (you can find in the MAKER directory) and use this GFF for further filtration.

  4. In filtration step: I suggest that first pick gene models with 5 prime and 3 prime UTRs that enriches good gene models for training. Then you can follow the further filtration steps. [if you face any problem with filtration after reaching to this stage write to me I can share AWK oneliners I have used for this purpose.]

  5. Finally filtered GFF can be used for training AUGUSTUS.

Best

ADD REPLY
0
Entering edit mode

Hello Puli,

Maker i started and it is running from past 12 days. Why its taking so much time ? ( 12 threads i had given )

It created around 254 folders and each folders has > 100 subfolders. Total of > 35000 folders for each scaffold i guess.

I have around ~3gb genome and ~45k scaffolds.

I am just worrying why it is taking lot of time.

I am waiting for the result and then i ll merge the gff.

Can u share the awk one liner and training augustus offline afterwards workflow.

ADD REPLY
0
Entering edit mode

Hi,

It will take time (i guess mainly for running exonerate and blast). I used 50 threads for ~1gb genome and ~160K scaffolds and program finished in ~12 hrs. Its normal MAKER takes time (make sure it is using all threads; use "top" command and check). I will post one lines once you are done with running. Send me a header of merged GFF file.

ADD REPLY
0
Entering edit mode

Hello,

I think that maker is now on final step. Below is the out put what i m seeing.

cleaning blastx...
in cluster::shadow_cluster...
...finished clustering.
in cluster::shadow_cluster...
...finished clustering.
cleaning clusters....
total clusters:37 now processing 0
 ...processing 0 of 11
 ...processing 1 of 11
 ...processing 2 of 11
 ...processing 3 of 11
 ...
 ...processing 174 of 177
 ...processing 175 of 177
 ...processing 176 of 177
flattening protein clusters
prepare section files

Since i have given predicted proteins and transcripts can we use the maker out put as final result.

Can u tell me how can we filter out the Gff (final model) and then corresponding protein and gene model sequence.

Thank you in advance.

ADD REPLY
0
Entering edit mode

Hello,

My maker run is over

Below is the Output

`Start_time: 1509976900
End_time:   1511273877
Elapsed:    1296977`

Not able to merge the gff.

Can u help me out.

ADD REPLY
0
Entering edit mode

what was the error when you run command for merging GFF.

Also check in "base.maker.output/base_datastore", for gff files for individual contigs are present or not.

ADD REPLY
0
Entering edit mode

Hello,

I took little time but the gff was generated, Below is the look of my gff i used -g option for pulling out the gff.

##gff-version 3
###
###
Scaffold_21964|length1264       maker   gene    741     1212    .       +       .       ID=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0;Name=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0
Scaffold_21964|length1264       maker   mRNA    741     1212    288     +       .       ID=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1;Parent=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0;Name=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1;_AED=0.05;_eAED=0.05;_QI=0|1|0.5|1|0|0|2|0|95
Scaffold_21964|length1264       maker   exon    741     888     .       +       .       ID=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1:1;Parent=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1
Scaffold_21964|length1264       maker   exon    1073    1212    .       +       .       ID=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1:2;Parent=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1
Scaffold_21964|length1264       maker   CDS     741     888     .       +       0       ID=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1:cds;Parent=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1
Scaffold_21964|length1264       maker   CDS     1073    1212    .       +       2       ID=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1:cds;Parent=maker-Scaffold_21964|length1264-exonerate_protein2genome-gene-0.0-mRNA-1
###

At the bottom of file fasta file ( assembled genome fasta is present) that should be removed. Individual gene or proteins sequences if we want to pull out like gff can we do that.??

ADD REPLY
0
Entering edit mode
        #after running maker merged gff file was used and first round of filtration was carried out by AED score cutoff.     
awk '{if($3=="mRNA")print$0}' ./base.all.gff |grep -v "trnascan"|grep  "est2genome"| grep "mRNA-1"| grep "_AED=0.00"|grep "Name"|awk '{split($9,a,";");print a[2]}'|awk '{split($1,a,"=");print a[2]}'> ./for_Augustus/MAKER_filter1_list

        grep -Fwf ./for_Augustus/MAKER_filter1_list ./base.all.gff > ./for_Augustus/MAKER_filter1.gff

        #remove mono exonic genes

         cat ./for_Augustus/MAKER_filter1.gff |awk '{split($9,a,";");print$3,a[2]}'|awk '{split($2,b,"=");print$1,b[2]}'| awk 'a ~ $0{print}; {a=$0}'|grep "^exon"| awk '{split($2,c,"-mRNA");print c[1]}'| sort -u > ./for_Augustus/non_monoExonic_MAKER_filter2.gff_list

        grep -Fwf ./for_Augustus/non_monoExonic_MAKER_filter2.gff_list ./for_Augustus/MAKER_filter2.gff > ./for_Augustus/MAKER_filter3.gff

        # get both three prime and five prime utr containing genes

        cat ./for_Augustus/MAKER_filter3.gff |awk '/three_prime_UTR/ {print$0}' |awk '{split($9,a,";");print$3,a[2]}'|awk '{split($2,b,"=");print$1,b[2]}'| awk '{split($2,c,"-mRNA");print c[1]}'| sort -u > ./for_Augustus/three_prime_utr_positive_gene_list

        cat ./for_Augustus/MAKER_filter3.gff |awk '/five_prime_UTR/ {print$0}' |awk '{split($9,a,";");print$3,a[2]}'|awk '{split($2,b,"=");print$1,b[2]}'| awk '{split($2,c,"-mRNA");print c[1]}'| sort -u > ./for_Augustus/five_prime_utr_positive_gene_list

        sort  ./for_Augustus/three_prime_utr_positive_gene_list ./for_Augustus/five_prime_utr_positive_gene_list |uniq -d > ./for_Augustus/five_three_prime_utr_positive_gene_list


        grep -Fwf ./for_Augustus/five_three_prime_utr_positive_gene_list ./for_Augustus/MAKER_filter3.gff > ./for_Augustus/MAKER_filter4.gff

    # remove utrs from gff (optional, but I will suggest for running without errors)

    awk '!/three_prime_UTR|five_prime_UTR/ {print$0}' ./MAKER_filter4_mod.gff > ./MAKER_filter4_mod_withoutUTR.gff


    #convert to gb format (scripts are in augustus)

    ./scripts/gff2gbSmallDNA.pl ./MAKER_filter4_mod_withoutUTR.gff  /genome.fasta 10 ./MAKER_filter4_mod_withoutUTR.gb

    # split test and training set (~1:5) (this is example, modify accordingly)

    ./scripts/randomSplit.pl ./MAKER_filter4_mod_withoutUTR.gb 200
    grep -c LOCUS ./MAKER_filter4_mod_withoutUTR.gb*                                                               

    ./MAKER_filter4_mod_withoutUTR.gb:1279
    ./MAKER_filter4_mod_withoutUTR.gb.test:200
    ./MAKER_filter4_mod_withoutUTR.gb.train:1079

    # create new file for your species (of interest) 

    ./scripts/new_species.pl --species=species

    #etrain

    ./bin/etraining --species=species ./MAKER_filter4_mod_withoutUTR.gb.train

    #test
    ./bin/augustus --species=species ./MAKER_filter4_mod_withoutUTR.gb.test | tee firsttest.out

    #optimize
    ./scripts/optimize_augustus.pl --cpus=10 --species=species ./MAKER_filter4_mod_withoutUTR.gb.train --metapars=./config/species/species/species_metapars.cfg



    #busco
    ./scripts/run_BUSCO.py -c 10 --long -sp species -o species_test -i /genome.fasta -l /BUSCO/busco/data/eukaryota_odb9 -m genome

Let me know if you face any problem with running these lines.

Best

ADD REPLY

Login before adding your answer.

Traffic: 1551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6