I am doing annotation two assemblies, one done by SPAdes and other downloaded from NCBI. I used the data from PRJNA387062 (ncbi). When I used Prokka for annotation on both assemblies, I get large differences between the two annotations. Below is the summary on assembly done by SPAdes.
organism: Genus species strain
contigs: 491
bases: 3928941
repeat_region: 1
CDS: 3629
tRNA: 79
tmRNA: 1
rRNA: 4
Following is summary of the annotation done on assembly submitted to NCBI.
organism: Genus species strain
contigs: 16
bases: 4599140
tRNA: 80
tmRNA: 1
CDS: 4420
repeat_region: 2
rRNA: 21
There is vast difference in rRNA and CDS counts. Is this difference acceptable?
It’s common to get different results from different annotators. Prokka is particularly conservative in its calling of features.
“Acceptable” depends what you want to do with the data?
Both would be acceptable submissions, there are plenty of genomes in NCBI that have been annotated with Prokka.
I’d be more concerned by the fact that your 2 assemblies are over half a Megabase different in size, which could reasonably account for the ~800 CDS difference.
I used SPAdes for assembly. So what difference in number of base pairs is said to be "acceptable"?
How did you arrive at the number of CDS difference?
There isn't an 'acceptable' value. It depends what's going on with your data. An 'acceptable' difference is one you can explain, without it negatively affecting your analyses.
Check both assemblies for genome coverage, contamination, the presence of mobile elements etc.
The difference in CDS is apparent from your data. One genome has ~3600, the other has ~4400. A difference of roughly 800 loci. For most prokaryotes, there are (very) approximately 1000 CDSs per megabase. Your data is off by a little under 1 megabase, so you have a little under 1000 gene difference as a result.