Hi everyone,
I am doing annotation for rice pangenome. The pangenome was created based on an iterative mapping and assembly approach. The pangenome including IRGSP-1.0 as the base reference, and unmapped assembled sequences from other rice accessions was concatenated to the base reference to get the entire pangenome.
Then I conduct gene prediction and annotation following the MAKER annotation pipeline. First, I constructed a repeat library for my pangenome (RepeatModeler). For ESTs: I used cDNA data set and protein evidence (entire plant Swissprot protein) in fasta. Then, I run the first round of MAKER with the EST evidence and repeat the library. Then, I kept only genes with AED<0.5 in the gfff3 file. Then, these genes were used to train SNAP. For Augustus training, I used the embryophyta_odb10 database from BUSCO. Finally, I run the second round of the MAKER pipeline with gff3 from round1, snap.hmm and Augustus parameter for running round2. The output file from round 2. But the maker only predicts for me 20,160 genes. Compared to the Nipponbare IRGSP 1.0 from RAP-DB, it has 37,861 genes.
I know maybe something should be optimised for MAKER in this pangenome. However, I spent too much time on this annotation step. Since my pangenome has entire IRGSP1.0 sequences. I want to improve the annotation of "base reference in my pangenome =IRGSP1.0" by comparing 2 gff3 file ( one is downloaded of IRGSP 1.0 gff form RAP-DB, one is produced by MAKER), extract the missing genes and its features, along with its position and add it to my pangenome gff.
However, due to two different of annotations from MAKER and the public database, some of features are a bit different in feature position. Is it possible to do this strategy by comparing the stop codon ? Does anyone know how to do that?
Many thanks,