When working with gene sequences from an understudied species, it can be challenging to know whether the gene prediction model is correct. Methods for determining this include comparisons to other "known" gene predictions as well as RNAseq data - manually or with the help of tools like GeneValidator (caveat: I am a coauthor).
Such approaches rest on having high quality databases to compare to. Many of us know that the SwissProt database is of high quality because the gene predictions it contains are manually examined and fixed (i.e., curated) by expert users. But it only contains (relatively) few genes from few organisms.
Additional curation occurs as part of most new genome projects: dozens to hundreds of gene predictions are similarly manually curated by phd students, postdocs, staff scientists and professors. However, as far as I know - the knowledge of which gene predictions were curated and which are raw & uncurated are lost in the the supplementary materials of every paper because the manual and automatically determined gene predictions are merged into a single official geneset before submission to NCBI. This is potentially a huge loss. Or am I missing something?
i.e., Is there a database which centralizes curated gene predictions? Or a "tag" by which to identify manually curated gene predictions present in NCBI nr?
Thanks! Yannick
Thanks Andrzej - this is very helpful. I didn't realize this distinction. Do you know whether the 2 types of submissions people sequencing a new genome make would be similarly differentiated?