Hi all
I'm confused about the definition of "hypothetical" in PGAP RefSeq genome annotations... So far I assumed, that a hypothetical protein has only evidence for a ORF, but no experimental evidence for a protein product so far. While scrolling through the Genebank-File of E.coli K-12 substr. MG1655 (NC_000913) I've stumbled over 5 entries named "hypothetical protein". So I was wondering how did these entries made it into the RefSeq annotation? Having a closer look at one of these 5 entries, I was even more puzzled: gene: "uraA" (2618871..2620160) has a protein product labeled as "hypothetical protein", but also a protein_id: NP_416992.1, as well as references to Swissprot (P0AGM7) and many other databases. So I assumed, that this protein is still a predicted ORF, without experimental evidence. But looking at Swissprot entry P0AGM7, I find a entry with experimental evidence at protein level (annotation score 5 of 5). How do I have to read this?
Thank you all already in advance for helping me understand, whats going on here...
Cheers, chscho
This seems rather confusing. My guess would be that NCBI's original record is not updated with the latest annotation information that you are able to see on the external protein links.
True, but why would an entry starting with "U" (what are these entries anyways?) be preferentially updated over a reference genome entry with an "NC" accession? Furthermore - according to UniProt - was the protein structure (x-ray crystallography) already solved back in 2011... ...and the fact, that this "hypothetical protein" has a "NP_" accession is really bugging me, because if you perform a proteogenomic experiment to identify novel proteins (e.g. with a six-frame translation database) the definition of "novel" protein (= not found in RefSeq) is starting to fall appart...