I'm working on GAWN https://github.com/enormandeau/gawn, a genome annotation pipeline that produces fast results based on evidence from an available transcriptome.
Popular genome annotation pipelines are usually difficult to install and use and take forever to run. That is, when they do not break on random scaffolds. In the last years, I have been looking hard for an easy way to annotate newly assembled genomes. Even something that would provide "good-enough" annotation without ab initio gene prediction.
Last week, a colleague (see acknowledgement at the top of the README.md file in the GitHub repository) suggested we use the splice-aware GMAP aligner to generate basic gene annotations in a GFF3 format. This worked beautifully. He then added that we could use cufflinks and TransDecoder to add UTR regions. I then added some Swissprot based annotation of the transcripts that I propagate to the annotated genes on the genome.
Using GAWN, I annotated 3 eukaryote genomes over the last days (multiple times each as I was developing). Depending on the genome size, annotation takes between 30 minutes and a few hours. This is orders of magnitude faster more standard approaches and tools. The GMAP aligner does however require a good amount of RAM to index genomes. It took 61 Go of RAM to index a 3.7 Gb genome assembly, which is the main drawback. The other limitation is of course the requirement of having a transcriptome available for the same species or, alternatively, for a closely related species.
The output files are:
- A genome annotation GFF3 file
- A transcriptome annotation table
- A genome annotation table
Obviously, I still need to test how the annotation I am getting compares to existing annotation, especially the UTR annotations, but since the annotation is based on the gene and exon position of existing transcripts, which are annotated using Swissprot, I am fairly confident in the approach. I was thinking of trying the approach on the human and the danio genomes.
I am fairly excited about finally being able to annotate genomes rapidly and without nightmares.
Version v0.2, which is currently available, is fully functional. It has been tested on Linux only. It should work on OSX as long as you have the dependencies installed.
Here are the dependencies. The version numbers are the ones that have been tested. It is suggested that you use these or more recent versions, although the pipeline will probably work just fine with some older versions.
- GNU Linux or OSX
- bash 4+
- python 2.7+ (TODO or 3.5+)
- cufflinks v2.2.1+
- wget 1.17.1
- gnu parallel 2017xxxx+
- blastplus utilities (blastx) 2.3.0+
- a local copy of the swissprot database
I'd be glad to have your opinion on the approach, implementation, documentation, bugs, suggestions, etc.
Please chime in!
Hey, thank you for recommending such a good tool. I have a couple of questions: 1. Which transcriptome you have used for your annotation? 2. About "a local copy of the SwissProt database" in the last step, how do you prepare it? did you download from blast ftp folder which prepared by NCBI or you need to make your own according to which transcriptome you use. 3. In the "genome.annotation.table" file in the 05_result folder, how can I get the original sequence from my genome according to the ScaffoldName.
Thank you so much
Hi YNFan.
You will need to make sure the format of the IDs in the wanted.id file is exactly the same format as the one in the genome fasta file.
Thank you very much for sharing with the community your work and your insights. Do you think your pipeline would be useful if we had a diverse set of evidence data (ESTs, RNA-seq, etc)?
What I mean by this, is if we could pass multiple fasta files instead of a single
transcriptome
file.You can put multiple transcriptomes in a single fasta file and pass it to GAWN. Just be aware that in the annotation you risk having multiple annotations at the same (or similar) positions on the genome.
You could also add ESTs, RNA-seq assembled transcripts, etc, but you should avoid passing it a high volume of raw reads (let's say more than 100K).
Since GAWN blasts all the transcripts on the Swissprot database, giving it a lot of sequences (transcripts, ests, reads...) will make it very slow.
Please note that GAWN is not stable currently. Specifically, it breaks with some newer versions of its dependencies. I need to fix this and update the dependencies accordingly.
I would also really like to know the answer to this question. I'm embarking on some annotation for the first time and have no transcriptome from my species of interest (or any congener) so I was hoping to hedge my best by supplying > 1 transcriptome to whichever pipeline I end up using.