Question

How to extract the missing feature ( gene, CDS, ... when compare 2 gff3 file ?

0

Entering edit mode

9 months ago

Sony ▴ 20

Hello everyone,

I have assembled some rice accessions and performed gene prediction and annotation for these sequences ( Nipponbare also was assembled and annotated as one of my accessions in my dataset). For gene prediction and annotation, I followed MAKER annotation pipeline. I used 50000 rice FL-cDNA and entire plants protein sequences from Swissprot database as the evidence. Identify repeat with RepeatModeler and masked with RepeatMasking. For ab initio training: SNAP, train Augustus using the embryophyta_odb10 db in BUSCO. Filter gene with AED < 0.5. However, for Nipponbare sequences in my dataset, it could be predicted only 21000 gene with MAKER pipeline. In comparison with the existing Nipponbare IRGSP-1.0 from RAP-DB database, The IRGSP has more than 39000 genes.

I know that something need to be optimize in my annotation pipeline. But for now, I just want to compare which features in gff3 file of my Nipponbare are missed in IRGSP gff3, and extract these missing feature and update it into my Nipponbare gff3 file What I did is I try to compare the column number 4 and 5 between these 2 gff3 file. The start and end position in gff file of IRGSP if it not exist in my Nipponbare gff, It is considered as a miss feature. however, I extracted 30000 "gene" missing. I think If based on the information in column 4,5 that will not be useful. My purpose is to extract the missing feature like: gene, CDS, exon ,,,

Does anyone know how to anytool or program or strategy to do that ?

gff3 compare MAKER annotation RAP-DB • 345 views

ADD COMMENT • link 9 months ago by Sony ▴ 20