Question

How to compare and complement the missing gene and feature in gff3 file base on available gff3 reference ?

0

Entering edit mode

5 months ago

Sony ▴ 20

Hi everyone,

I am doing annotation for rice pangenome. The pangenome was created based on an iterative mapping and assembly approach. The pangenome including IRGSP-1.0 as the base reference, and unmapped assembled sequences from other rice accessions was concatenated to the base reference to get the entire pangenome.

Then I conduct gene prediction and annotation following the MAKER annotation pipeline. First, I constructed a repeat library for my pangenome (RepeatModeler). For ESTs: I used cDNA data set and protein evidence (entire plant Swissprot protein) in fasta. Then, I run the first round of MAKER with the EST evidence and repeat the library. Then, I kept only genes with AED<0.5 in the gfff3 file. Then, these genes were used to train SNAP. For Augustus training, I used the embryophyta_odb10 database from BUSCO. Finally, I run the second round of the MAKER pipeline with gff3 from round1, snap.hmm and Augustus parameter for running round2. The output file from round 2. But the maker only predicts for me 20,160 genes. Compared to the Nipponbare IRGSP 1.0 from RAP-DB, it has 37,861 genes.

I know maybe something should be optimised for MAKER in this pangenome. However, I spent too much time on this annotation step. Since my pangenome has entire IRGSP1.0 sequences. I want to improve the annotation of "base reference in my pangenome =IRGSP1.0" by comparing 2 gff3 file ( one is downloaded of IRGSP 1.0 gff form RAP-DB, one is produced by MAKER), extract the missing genes and its features, along with its position and add it to my pangenome gff.

However, due to two different of annotations from MAKER and the public database, some of features are a bit different in feature position. Is it possible to do this strategy by comparing the stop codon ? Does anyone know how to do that?

Many thanks,

gff3 IRGSP1.0 MAKER • 371 views

ADD COMMENT • link updated 5 months ago by lieven.sterck 15k • written 5 months ago by Sony ▴ 20

score 0 · Answer 1 · 2024-11-19

Hi,

though theoretically possible merging gene annotations is a bit tricky approach.

Let me first say that your general setup on how to tackle this is fine ! (I would do nearly the same personally :) ).

There are a few things you can do or check to 'fix' the lower number of predicted genes:

double check your repeat library. It's very well possible that there are some false positives in that set (== true genes falsely assigned to be TEs), especially if your "genome" is not non-redundant: you can have sequences in your pan genome that are from the same genomic location and will thus have equal genes on it which might be flagged as TEs (since they appear more then expected in the dataset). You can check this for instance by screening your TE-lib to nr-prot and remove genes that do not seem TE , based on functional description.
- You can tweak the parameters of SNAP/AUGUSTUS a bit to be more lenient in their prediction
- Check what the missing genes might have in common and specifically adapt your approach to compensate for this

If you go for the merging of GFF files, there are some tools around that could help you with this (or let say that will do this for you) :

The AGAT toolbox is the first one that comes to mind (AGAT) ; see also : How can I merge GFF files together to produce a file with gene functions from both?
along the same line: GFF3toolkit & Annotation Files Merger
or have a look at gene prediciton tools that work on that principle,eg: EvidenceModeler , and many others (do a search and focus on 'combiner gene prediction tools'