Question

Tool:GAWN - Genome Annotation Without Nightmares

12

Entering edit mode

7.3 years ago

Eric Normandeau 11k

I'm working on GAWN https://github.com/enormandeau/gawn, a genome annotation pipeline that produces fast results based on evidence from an available transcriptome.

Popular genome annotation pipelines are usually difficult to install and use and take forever to run. That is, when they do not break on random scaffolds. In the last years, I have been looking hard for an easy way to annotate newly assembled genomes. Even something that would provide "good-enough" annotation without ab initio gene prediction.

Last week, a colleague (see acknowledgement at the top of the README.md file in the GitHub repository) suggested we use the splice-aware GMAP aligner to generate basic gene annotations in a GFF3 format. This worked beautifully. He then added that we could use cufflinks and TransDecoder to add UTR regions. I then added some Swissprot based annotation of the transcripts that I propagate to the annotated genes on the genome.

Using GAWN, I annotated 3 eukaryote genomes over the last days (multiple times each as I was developing). Depending on the genome size, annotation takes between 30 minutes and a few hours. This is orders of magnitude faster more standard approaches and tools. The GMAP aligner does however require a good amount of RAM to index genomes. It took 61 Go of RAM to index a 3.7 Gb genome assembly, which is the main drawback. The other limitation is of course the requirement of having a transcriptome available for the same species or, alternatively, for a closely related species.

The output files are:

A genome annotation GFF3 file
A transcriptome annotation table
A genome annotation table

Obviously, I still need to test how the annotation I am getting compares to existing annotation, especially the UTR annotations, but since the annotation is based on the gene and exon position of existing transcripts, which are annotated using Swissprot, I am fairly confident in the approach. I was thinking of trying the approach on the human and the danio genomes.

I am fairly excited about finally being able to annotate genomes rapidly and without nightmares.

Version v0.2, which is currently available, is fully functional. It has been tested on Linux only. It should work on OSX as long as you have the dependencies installed.

Here are the dependencies. The version numbers are the ones that have been tested. It is suggested that you use these or more recent versions, although the pipeline will probably work just fine with some older versions.

GNU Linux or OSX
bash 4+
python 2.7+ (TODO or 3.5+)
cufflinks v2.2.1+
wget 1.17.1
gnu parallel 2017xxxx+
blastplus utilities (blastx) 2.3.0+
a local copy of the swissprot database

I'd be glad to have your opinion on the approach, implementation, documentation, bugs, suggestions, etc.

Please chime in!

genome-annotation • 5.1k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 7.3 years ago by Eric Normandeau 11k

0

Entering edit mode

Hey, thank you for recommending such a good tool. I have a couple of questions: 1. Which transcriptome you have used for your annotation? 2. About "a local copy of the SwissProt database" in the last step, how do you prepare it? did you download from blast ftp folder which prepared by NCBI or you need to make your own according to which transcriptome you use. 3. In the "genome.annotation.table" file in the 05_result folder, how can I get the original sequence from my genome according to the ScaffoldName.

Thank you so much

ADD REPLY • link 7.3 years ago by YNFan • 0

0

Entering edit mode

Hi YNFan.

Ideally, you use a transcriptome from the species for which you want to annotate the genome. Alternatively, use a species that is close phylogenetically.
Here is a link to install the blast utilities and get the databases. https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download You will need to download the NR and SWISSPROT databases to your computer.
In the "genome.annotation.table" file, you have the name of your query. Use the names to get the sequences from the genome fasta file. To do this, you can put the sequence IDs in a text file (put one ID per line) and run the following command that uses one of my scripts found here: https://github.com/enormandeau/Scripts

fasta_extract.py genome.fasta wanted.ids wanted.fasta

You will need to make sure the format of the IDs in the wanted.id file is exactly the same format as the one in the genome fasta file.

ADD REPLY • link 7.3 years ago by Eric Normandeau 11k

0

Entering edit mode

Thank you very much for sharing with the community your work and your insights. Do you think your pipeline would be useful if we had a diverse set of evidence data (ESTs, RNA-seq, etc)?

What I mean by this, is if we could pass multiple fasta files instead of a single transcriptome file.

ADD REPLY • link 7.1 years ago by chefarov ▴ 170

1

Entering edit mode

You can put multiple transcriptomes in a single fasta file and pass it to GAWN. Just be aware that in the annotation you risk having multiple annotations at the same (or similar) positions on the genome.

You could also add ESTs, RNA-seq assembled transcripts, etc, but you should avoid passing it a high volume of raw reads (let's say more than 100K).

Since GAWN blasts all the transcripts on the Swissprot database, giving it a lot of sequences (transcripts, ests, reads...) will make it very slow.

Please note that GAWN is not stable currently. Specifically, it breaks with some newer versions of its dependencies. I need to fix this and update the dependencies accordingly.

ADD REPLY • link 7.0 years ago by Eric Normandeau 11k

0

Entering edit mode

I would also really like to know the answer to this question. I'm embarking on some annotation for the first time and have no transcriptome from my species of interest (or any congener) so I was hoping to hedge my best by supplying > 1 transcriptome to whichever pipeline I end up using.

ADD REPLY • link 7.1 years ago by maxwhjohn1988 ▴ 130