Forum:Eukaryotic Genome Annotation in 2016
2
22
Entering edit mode
8.8 years ago

Hi,

I will have to annotate a genome soon. In the past, I have used Maker2 for this task. Maker2 was a pain in every department (installation, documentation, usage, usability of results, you name it), so I REALLY wish I can use something else. I would also want do avoid the GENBANK annotation pipeline.

The genome is about 400Mbp long and from a diploid species with lots of similar species that will have annotation, protein sequences, RNA-Seq data, etc.

I have been looking for alternatives but there does not seem to be tons of them. I found some suggestions in A beginner's guide to eukaryotic genome annotation (http://www.marcottelab.org/users/BIO337_2014/EukGeneAnnotation.pdf) but it dates back to 2012.

My main option now is to go with EVidence Modeler (EVM: http://evidencemodeler.github.io/) and PASA (http://pasapipeline.github.io/).

I would really like your input if there are other solid options that I am missing. For now, I am considering annotation pipelines, but maybe people do it differently, chaining different tools?

software genome annotation • 18k views
ADD COMMENT
0
Entering edit mode

I ended up developing a simpler genome annotation pipeline based on suggestions from a colleague. You can find more about it in this Biostar post: GAWN - Genome Annotation Without Nightmares GAWN - Genome Annotation Without Nightmares

ADD REPLY
18
Entering edit mode
8.8 years ago

What I've done with an eukaryotic genome I am currently working on is to use EVM to generate a high confidence list of gene predictions and then feed that into Augustus for training. Then used trained Augustus + hints for the final gene predictions. The general steps are:

1) RepeatModeler + RepeatMasker to get repeat regions

2) PASA on transcriptome if you have one (I used GMAP for mapping transcripts to genome). You can alternatively try the BRAKER method (The Augustus manual I think has a section on how to do that). Or you can also try to perform a reference assembly with STAR/tophat + cufflinks.

3) Exonerate align uniprot metazoa proteins on your genome

4) SNAP/GeneMark for ab initio predictions

5) Throw results from steps 2-4 into EVM (weighing the evidence is really subjective).

6) Get a list of "high-confidence" gene predictions from EVM that contains evidence from all three sources (PASA, exonerate, ab initio). This should be a relatively short list (I had ~800).

7) Train Augustus with these "high-confidence" EVM gene models.

8) Re-use PASA, Exonerate, SNAP, RepeatMasker as hints for Augustus and run gene predictions. The weighing of the hints is very subjective.

The number and completeness of your gene predictions will depend heavily on how fragmented/heterozygous your genome assembly is.

The most annoying part of this entire process for me was having to deal with all the different format conversions (MAKER does a lot of this stuff for you). EVM was relatively simple to deal with in terms of the input GTF file that it takes, but generating hint files for Augustus was more frustrating.

The reason I used Augustus for the final gene prediction was because it allows you to specify different feature categories like introns, exons, CDS, UTR, repeat regions, etc, which I thought might be more comprehensive than EVM.

An alternate method of doing this would be to run Augustus twice, instead of EVM + Augustus. You use the initial Augustus run to gather the high confidence gene models to train for the second Augustus run. I wanted to try to utilize different gene predictions methods, so I decided to do EVM + Augustus. I can't really tell you which is method is better to be honest.

**Edit If you really want to go crazy, you can try out Augustus' protein profile (Augustus-ppx) functionalities where it will attempt to scan the assembly for known protein motifs (kinda like a HMM) and incorporate that into the gene prediction. You can even generate a bunch of profiles based on alignments of genes of closely related species and use that.

ADD COMMENT
2
Entering edit mode

Thanks a lot for your very detailed answer!

ADD REPLY
0
Entering edit mode

I'm getting to installing all the tools required by your pipeline. What version of GeneMark do you use? There seems to be 3-4 variants (S, ES, T, .hmm).

I guess though that if I chose to use SNAP, I won't need GeneMark, right?

ADD REPLY
0
Entering edit mode

I actually used genemark-et which is a modified version of es that can use RNA-seq reads to try to do better predictions. If you want to go for a completely ab initio prediction, go with genemark-es/SNAP.

Yeah I think it is safe to use just SNAP or just GeneMark.

You can also try to predict CEGMA genes as an addition to your initial high-confidence gene models.

ADD REPLY
0
Entering edit mode

We are running exonerate now and are playing with the parameters, which can have a tremendous impact on the number hits (from very few to crazy high hits).

What parameters did you use in your annotation?

ADD REPLY
0
Entering edit mode

Hi Eric,

I used the protein2genome algorithm with refinement. I only kept hits where at least 50% of the protein aligns.

ADD REPLY
0
Entering edit mode

By I only kept hits where at least 50% of the protein aligns, to you mean you used --percent 50? If yes, that is not exactly what the --percent option does. It keeps proteins for which the actual score is at least 50% of the maximum possible score, which is a bit different given how the score is calculates. For example, introns will reduce the actual score. That being said, 50% seems like a good choice.

ADD REPLY
0
Entering edit mode

Thanks for the exonerate help! We are now at the snap stage. Our species is quite close to Drosophila melanogaster, for which an .hmm file is provided. Do you think we need to train snap with our specific genome? I guess using the D.melanogaster.hmm should be OK? Did you have to train snap for your genome?

ADD REPLY
0
Entering edit mode

For the species I was working on, we already had ~50-60 available Sanger sequenced genes from previous publications. We also had an available transcriptome, from which I found some pretty obvious coding genes (complete ORF, blast hit to related species). I used those genes to train SNAP and generated my own hmm.

ADD REPLY
0
Entering edit mode

We finished step 5 and are wondering how you extracted your list of "high-confidence" gene predictions in step 6. The EVM output and gff3 output are fairly different in terms of what information they present. Eg: the evidence provenances are only found in the output but are not found in the gff3 output.

We were thinking of running steps 2-4 with stringent parameters or post-filtering and then just accept what EVM will suggest. We could then re-run steps 2-4 with slightly less stringent parameters to provide these new outputs to Augustus. Would that make sense?

ADD REPLY
0
Entering edit mode

EVM runs the annotation step by first partitioning your reference contigs into individual folders. Then it can parallel run annotation on each contig. From my EVM run, I had a 'partition' folder which contains a subfolder for each reference contig. Within each contig folder, there is an EVM.out file.

I wrote a script to then iterate through the EVM.out file of every contig folder to generate an "evidence" file which looked something like:

scaffold.1       11      5015    PASA/asmbl_152961/align_1193019,genemark_99785_t        genemark,PASA

Meaning, there is a prediction on scaffold.1 from position 11-5015 with support from genemark and PASA. Using the scaffold and coordinates, I matched it to the prediction in the .gff file. So for the above prediction, I had a corresponding gff entry:

scaffold.1    EVM   gene    11      5015    .       +       .       ID=evm.TU.scaffold.1.1;Name=EVMprediction.01

**update

Here is the script for generating the evidence file

Save as whateverName.py and run by:

python whateverName.py pathToPartitionFolder

It's a pretty hacky script, tell me if it is not working and I'll try to modify it.

ADD REPLY
0
Entering edit mode

Dear Damian Kao,

Thanks for your python script. I've ran it but end up with an error:

Traceback (most recent call last): File "evalue_evidence.py", line 7, in <module> size = os.path.getsize(fPath) File "/usr/lib64/python2.7/genericpath.py", line 57, in getsize return os.stat(filename).st_size OSError: [Errno 2] No such file or directory: 'patition//flattened_line_5630 circular/evm.out'

I checked the result, some of predictions in the evm.out were processed, but not all. I don't understand python language, can I have your support here. Thank you very much.

Sincerely,
    Du Kang
ADD REPLY
0
Entering edit mode

It looks like maybe you added files into the partition folder? What is "patition/flattened_line_5630 circular"? I don't think that is a standard evm output? The script is expecting only folders in the partition directory. Remove all non-evm generated files from the partition directory.

ADD REPLY
0
Entering edit mode

Thank you very much, it's OK now, you are the best!!

ADD REPLY
0
Entering edit mode

Hi Damian,

I just have some questions regarding your comment.

1) If I use BRAKER method, it also already included Genemark-ET abinitio prediction and steps for optimization of Augustus training, do I need to perform step 4 in your comment?

2) Could we use EVM instead of Augustus as the final combiner of all prediction sources (Augustus, Genemark, Exonerate or other evidences), and how about JIGSAW in your opinion?

3) If I have RNAseq data, do you think I should do gene prediction based on both ways: a) RNAseq read alignment to genome (GMAP, Blat) and b) build transcriptome (Trinity, Tophat-Cufflink) then use these transcripts for gene prediction. Which way is better?

4) After all, when I obtain the ultimate gff file of gene prediction for my target genome, how could I know it is better than other prediction version?

Sorry for long questions, thank you very much in advance for your suggestion! It would be extreme helpful for my current research project.

ADD REPLY
0
Entering edit mode

1) The point of the first iteration is just to get a set of high quality annotations. Using more software sources to generate your potential set of annotations will only give you more confidence about your set of predictions. So if you do both BRAKER and Genemark-ET, you'll be able to filter for genes that have evidence from both sources. Of course, this probably depends on how different the BRAKER and Genemark-ET algorithms are. It is possible that you are just biasing your results towards your RNA-seq data.

2) You can use EVM instead of Augustus. I want to stress that there is no objectively "correct" way of doing gene predictions. I chose to use Augustus as the final software to do gene predictions, because I thought that its ability to predict multiple component classes of genic features (exons,introns,CDS,UTRs...) is an advantage over EVM.

3) I used PASA to try to consolidate my transcriptome data. And I did also perform a tophat-cufflinks assembly as input into PASA. I can't tell you whether it will be better or not. In my opinion, the point of using as many different software as possible is to capture as much potential data as possible, as various software have their strengths and weaknesses. And the job of EVM or Augustus is to consolidate them in a consistent way. At the end of the day, you will still have to subjectively decide how to filter out these various sources of evidences.

4) I think gene prediction can be a pretty subjective process. If you have known gene annotations confirmed experimentally, you can attempt to judge your set of predictions by them. However, if you've incorporated the known annotations into your prediction, then of course you will have them already in your predictions.

Most bioinformaticians can probably attest to the nightmare of chasing the "perfect" data. You can easily spend months iterating on the gene prediction process. There is no perfect gene predictions. There is just how inclusive your data is of real things along with false positives vs how strictly you are defining genes along with throwing out potentially real data. The months spend on iterating your gene predictions just moves the cursor from one end of the spectrum to the other.

In terms of practical, useful data for other people to use, you should give people the most lax set of gene predictions which will probably include a lot of crap or and also a more strict set of gene predictions that you've filtered for using various biologically sound criteria. What you provide is a rough guide-line. Ultimately, whoever is using your data will have to do filtering/assessment/more experiments for their own particular use.

ADD REPLY
0
Entering edit mode

Hi Damian,

Thank you very much for your detail answer!

Regarding the alignment of protein evidence onto assembled genome, should I use proteins (Uniprot source) from its neighbor species instead of metazoa proteins? Just because I think it would be better if the evidence come from close species, so as their genome similarity would be the highest. I also used Scipio for this purpose, but it seems to annotate only "exon" feature, which is quite limited compare to Exonerate.

And just another question, do you think PASAlite give the similar output as PASA do? (because I am using computing cluster without MySQL access privilige)

Best regards

ADD REPLY
0
Entering edit mode

I think neighboring species is fine. I used metazoa just to make sure I get everything. But that's probably overkill anyways. I haven't used PSASlite, so I have no clue.

ADD REPLY
0
Entering edit mode

Thank you very much!

ADD REPLY
0
Entering edit mode

Thanks for your pipeline !

Do you have a github or log file where I can find the commands you used ??

ADD REPLY
0
Entering edit mode

Dear Damian Kao,

In the step 8, using SNAP output as a hint. I don't know which kind of resource should I assign to it? None seems right for me.

# source of extrinsic information:
# M manual anchor (required)
# P protein database hit
# E est database hit
# C combined est/protein database hit
# D Dialign
# R retroposed genes
# T transMapped refSeqs

Do you have advice? Thanks very much.

  Sincerely, 
        Du Kang
ADD REPLY
0
Entering edit mode

The extrinsic sources are defined in a extrinsic.cfg file somewhere in your augustus directory. You can manually edit their weights or add new sources if you want. You are going to have to play around with the weights. This part can be very subjective.

ADD REPLY
0
Entering edit mode

Hi, what would one do if performing denovo genome annotation without any transcriptome data or RNAseq data availability? Is it good to go with closely related species transcriptome and other data availability?

ADD REPLY
0
Entering edit mode

The more relevant evidence you can use the better. I routinely use EST clusters from closely related species in my funannotate pipeline and then use curated UniProtKB proteins as evidence. I would try to use transcripts that are actually experimental, i.e. using the annotation from a closely related species may not be the best idea if they are just inferred from the ab initio gene predictors. But mappings from de novo transcriptome or EST from a closely related species can be informative to define intron-exon boundaries of highly conserved genes.

ADD REPLY
0
Entering edit mode

Hi Jon, I am aligning UniprotKB metazoa proteins reviewed and unreviewed all and aligning on to my 17000 scaffolds of 115Mb using exonerate. Do you have any idea how long it will take to give my final output? PC configuration is 16GB 64threads. I am in a very tight schedule to finish this draft genome annotation by next weekend. If it will take longer, is there any other way to tweak my pipeline? I have completed abinitio predictions using, GeneMark, augustus, geneid, snap and glimmerhmm. Now I want to derive one consensus and high confidence gene model using evidence modeler after integrating this protein2genome alignment.

ADD REPLY
0
Entering edit mode

Can we feed blastx results into exonerate?

ADD REPLY
0
Entering edit mode

For protein evidence alignments to a genome, you want to use exonerate --model protein2genome. Exonerate does a good job, but is pretty slow. What I do in funannotate is to run tblastn to get hits and then slice the genome according to hit location and run exonerate on that section - this speeds up the process quite a bit. For evidence modeler, you will need to conform to their GFF format for the protein alignments to work properly, there is a script in the EVM distribution that converts exonerate output to EVM GFF3 format, it requires that exonerate is run like this: exonerate --model p2g --showvulgar no --showalignment no --showquerygff no --showtargetgff yes --maxintron $MAX --percent 80 --ryo "AveragePercentIdentity: %pi\n" proteins.fasta genome.fasta

ADD REPLY
0
Entering edit mode

How much time will your pipeline take to annotate a nematode genome of size 115Mb in a PC with 16GB configuration? Can I skip the process of ab-initio prediction? I have results from five prediction output as mentioned above.

ADD REPLY
0
Entering edit mode

Dear Damian,

In "8) Re-use PASA, Exonerate, SNAP, RepeatMasker as hints for Augustus and run gene predictions. The weighing of the hints is very subjective.", how can RepeatMasker output be hints for Augustus? Is there any chance that it's a typo?

  Sincerely,
        Kang
ADD REPLY
0
Entering edit mode

Repeatmasker is used here to prevent augustus from predicting genes in repeat regions.

ADD REPLY
0
Entering edit mode

Thanks for your answer Damian. I've built a whole pipeline base on your intelligence. I feed every software with repeat masked assembly sequences, so I didn't expected in the end I have to hint Augustus with RepeatMasker result again. Anyway, so I guess the RepeatMasker hint file is adapted from the .out file from RepeatMasker results, and the value in extrinsic setting for RepeatMasker hint should be lower than 1, am I right? Thanks again, you really help me a lot. Sincerely, Kang

ADD REPLY
7
Entering edit mode
8.8 years ago
Jon ▴ 360

What Damian describes is essentially very similar to what I'm doing with a package of scripts to parse all of the different data formats together, right now is specific for fungi (but I think I could make it more general if there is interest) - funannotate located here: https://github.com/nextgenusfs/funannotate.

Funannotate is composed of 3 main scripts: predict, annotate, and compare. It can streamline the whole process and results in NCBI ready WGS submission files. It does the following steps and converts all of the output from the different tools into proper format for downstream processing.

funannotate predict

1) RepeatModeler de novo detection of repeats

2) RepeatMasker soft-mask repeats

3) Aligns protein evidence to genome using tblastn/exonerate (default is uniprotkb database - but can use multiple sources)

4) Aligns transcripts to genome using GMAP (can be a variety of sources, i.e. Trinity/PASA, closely related species ESTs, etc)

5) Trains/runs Augustus (if RNA-seq BAM file passed then uses BRAKER1 to train, if PASA/transdecoder GFF passed uses that, otherwise uses BUSCO prediction to train)

6) Runs GeneMark-ET/ES (GeneMark-ET via BRAKER1, otherwise self-training GeneMark-ES)

7) Evidence Modeler constructs best gene models from all predicitons (Augustus, GeneMark, PASA (optional), protein alignments, and transcript alignments.

8) tRNAscan-SE predicts tRNAs

9) Filters out transposons and "bad" gene models (internal stops, etc)

10) Make Genbank .tbl file using Genome Annotation Generator (GAG) and convert to GenBank flat-file using tbl2asn

11) Parse tbl2asn error report, removing problematic gene models

12) Finally run back through GAG -> tbl2asn

funannotate annotate

1) Assigns functional annotation using PFAM, InterPro, UniProtKB, MEROPS proteases, CAZymes, GO ontology, BUSCO models

2) Incorporates functional annotation using GAG

3) Creates GenBank submission files using tbl2asn and other scripts (.sqn, .tbl, .agp, .contigs)

funannotate compare

1) Parses functional annotation for each genome, making graphs, plots, etc

2) Runs ProteinOrtho5 clustering tool to find orthologous groups

3) Runs RAxML to generate phylogeny from randomly chosen BUSCO single copy orthologs

4) finally outputs all the data into HTML format

ADD COMMENT
0
Entering edit mode

Looks like a great pipeline! How do you filter out transposons?

ADD REPLY
0
Entering edit mode

Doing two things, 1) if a gene model is 90% contained within a repeat region it is removed and 2) Blastp searching a repeat database (from transposon PSI and Repbase). This seems to work reasonably well.

ADD REPLY
0
Entering edit mode

Thank you Jon for a more thorough detailing of your funannotate pipeline. Twitter was a poorer fit to explain all of these details :)

ADD REPLY
0
Entering edit mode

Hi Jon, I am following a similar pipeline to yours to predict the gene structures of a fungal genome, and I am thinking about the which weight give to each of the Evidences in EVM, but I am not very sure, Which are the weight that you use in your pipeline?

ADD REPLY
0
Entering edit mode

From discussions with Brian Haas, the evidences are somewhat empirical. Meaning that it depends a lot on what the input data is. Depending on the evidence that you pass to funannotate, the scripts automatically go with recommendations close to what Brian has recommended. Which are: in silico predictors (Augustus, GeneMark) are set to 1, transcript and protein alignments are also set to 1, if PASA/transdecoder is used it is set to 10, and then I do another trick with Augustus output to find well supported gene models (have >90% of exons supported by mapping evidence) and I give those high quality models a weight of 5.

ADD REPLY

Login before adding your answer.

Traffic: 1679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6