I asked NCBI help desk and here is the reply, for anyone else with similar concerns:
Dear Tim,
You ask some very shrewd questions. NCBI is continuing to refine the structure prediction algorithm within PGAP (Prokaryotic Genome Annotation Pipeline). Working as a biocurator with a deep interest in finding ways to improve PGAP, collaborating with our software developers, I have been trying to find ways to suggest improvements.
As a rule, our detection of coding region features (CDS) with frameshifts, but either mutation or sequencing/assembly artifact, is excellent, and our labeling of features as "/pseudo" is very likely to be correct. For certain cases of programmed frameshift (release factor 2, or IS element transposases), we identify the programmed nature of the frameshift and predict a full length real protein correctly. Our detection of "/pseudo" features with internal stop codons is also quite good.
We have greater difficulties distinguishing pseudogenes with truncated coding regions from valid real proteins that simply have novel domain architecture. We feel the algorithm works quite well for some heavily studied lineages, including E. coli, Salmonella, and other Gram-negative pathogens from the Enterobacteriaceae. PGAP has more difficulty in GC-rich taxa such as Streptomyces or Amycolatopsis. Changes over the last couple of years have shown improved preservation of long, multidomain proteins such as NRPS and PKS, but clearly some problems remain.
I will perform a more detailed analysis of our structural annotation for the Amycolatopsis, in order both to report to you how reliable our "partial-in-the-middle" pseudogene reports are in this species, and to document for our developers what the failure mode looks like, and how we might fix it, when our assertion that a protein is "/pseudo" is most probably in error. One option we are exploring is making algorithmic changes, including setting a minimum threshold of percent identity for allowing PGAP to judge that a different in architecture represents a degraded gene rather than simply an alternative architecture. Another is an expansion of the set of proteins used for identification of homologs informative on gene structure.
I'd like to thank you for writing to us, as feedback on the concerns of experienced users assist us in finding and implementing ways of improving our pipeline.
EDIT - they sent a follow up email:
Once PGAP decides that a feature is a pseudogene, not a functional gene with start and stop codons in the expected locations,
the obligation to run from start to stop is dropped. We try to use homology to available proteins or HMMs to estimate the size of
the pseudogene feature.
PGAP uses GeneMark S2+ for ab initio prediction, by the way, not Snapgene (a tool I do not use).
I attached a file showing my results from a very detailed review. The purpose was dual, partly to answer your questions and
partly to continue work I have been doing, analyzing PGAP's performance and looking for ways to improve accuracy. We can
expect accuracy to improve over time as we
- add new protein family models, especially protein profile hidden Markov models (HMMs)
- expand the set of reference proteins in our "Naming Set", currently about 14,000,000 proteins, most of which date from a collection built more than 8 years ago.
- make adjustments to the structural annotation algorithm in PGAP.
"pseudo=true" assertions made based on frameshifts and internal stop codons proved very reliable in the Amycolatopsis genome.
"pseudo=true" assertions made based on "partial-in-the-middle" findings were somewhat less reliable.
Examples seen included known problems such as
- too low a tolerance for tail-to-tail overlaps in GC-rich organisms
- partial-in-the-middle determinations based on homologs distant enough that domain architectures might be expected to differ
While we are working to address these issues, my finding is that pseudogene assignments are mostly correct, very much so for frameshifts and internal stops,
but true even for most apparent truncations.
I recognize that for any one genome, it could be advantageous to see every protein translation that might be real. For NCBI, enabling that
would create some costs. To instantiate every possible translation as a protein would flood databases of unique proteins with multiple broken
forms for every one proper form. PGAP plays a gatekeeper role, and occasionally stops the prediction of a real protein. We are very much
aware this happens and constantly looking for ways to improve our discrimination of real from pseudo.
I don't know the exact answer to either of your questions.
Yet consider this: there are 3 stop codons out of 64 that are available. That means, in simplest statistical terms, that any open reading frame (ORF) longer than 21 codons could be considered real. It is not as simple as that as the codon usage is not uniform, yet ORFs considerably longer than 21 codons are unlikely to happen. This becomes increasingly unlikely the longer the ORF is. So they could have some kind of a cut-off where any ORF longer than say 100 codons is automatically annotated, whether it has start and stop codons or not.
Hi Mensur, thanks for the reply. Would the biological assumption here be that the CDS is using a start/stop codon of which we are unaware? Because I'm not sure how it could be considered real without these features.
timothy.kirkwood the best option is to send this question in to NCBI help desk. As Mensur Dlakic points out NCBI could be using a set criteria to call these in their annotation pipeline. They may be documented publicly (I could not easily find them but that does not mean they are not there) and if not you will get an official answer.
Thanks GenoMax, I followed your advice and added the NCBI reply to my post in case other people are interested.