I have been tasked with determining statistical information on a dataset of microbial protein sequences, most importantly the average number of proteins per genome (think 'strain' or 'isolate') at a given taxonomic level. Given that some proteins in our dataset will come from incomplete genomes, I decided that in order to make the statistics as accurate as possible, I needed to consider only 'complete' genomes. Moreover, I also need to have the creation date/submission date of the completed genome, due to the fact that the protein set I'm working with comes from PFAM25.0, which means it was grabbed from a May 2010 snapshot of UniprotKB. Hence, I really can't regard any completed genomes created after this date.
What I did to handle this, right or wrong, was use the NCBI e-utilities elink interface to map taxonomic ids to genome ids, and then used the e-utilities esummary interface to pull genomic metadata for the genome identifier. If the create date was correct, and the base pair number exceeded a given threshold (the only way I knew to distinguish a chromosome from a plasmid), then I classified the genome as 'complete' and counted all protein sequences in that strain as valid. Otherwise, I threw out sequences from 'incomplete' genomes.
Recently, however, NCBI overhauled their genome database (without warning, from what I can tell), no longer mapping 'strains' to genome identifiers directly, but instead only mapping a species to a genome identifier. Moreover, their e-utilities interface has been limited to only using esummary on the genome database, and there is very little useful information I can gather from the e-utilities record that will be of value on a per-strain basis. Interestingly enough, the web-interface to the genome database provides a much richer view of the genome, including a breakdown of genomes by 'strain', providing information on the genome type (chromosomal vs. plasmid), and links to the nucleotide database sequence (RefSeq and/or INSDC accessions). For reasons beyond my understanding (unless I'm missing something), NCBI does not provide this same information in the e-utilities esummary output.
Thus, I sent an email to the NCBI help desk asking how they recommended handling this, but only received a canned answer pointing me to bulletins announcing the genome database overhaul and what fields to expect in the esummary output. Basically, there was very little value in the email approach.
I then proceeded to call up NCBI, and was in touch with someone who seemed to have some technical knowledge of the databases and interfaces. He basically told me that my approach was wrong using the old database format, regardless of the workflow issues that have arisen from the new database structure. As far as I could tell, he was trying to say that there really is no good way to tell if a genome is 'complete'. I asked if there was a good way to do this in a different way, but was basically told there is no good way to do this. I then wanted to know if there was a recommended external database that I could access (TIGR, GOLD, KEGG), but wasn't helped on this, either. Essentially, I was told there is no way to really reasonably accomplish this.
Thus, my questions, in order, are:
1) Is there a way to programmatically determine if a bacterial genome is essentially 'complete' before a given date? 2) If so, which single-point genome database would be the most comprehensive in providing this information? 3) If there are suggestions for ways to do this in NCBI, the only way I know of is using a "taxonomy" to "nuccore" link. If this is an option, a) How do I distinguish nucleotide entries that represent 'complete' genomes from those that are not 'complete', in a consistent way? I can't seem to find any NCBI annotation guidelines that guarantee that all genbank entries will be annotated indicating a complete genome (although some entries seem to have the text 'complete genome' in their definition, others have 'complete sequence'). b) How do I distinguish chromosomal DNA from plasmid DNA in a consistent way? I can't seem to find any NCBI annotation guidelines that guarantee that all genbank entries will be marked 'plasmid' where applicable.
Thank you,
AnDrew
Thanks for the hint on using the NCBI ftp site. Nearly all the information I needed was in the lproks_1.txt. My only wish is that NCBI provided an archive (or revision history) of these, as this would give me an idea of the history when the genomes were marked 'complete'.
Yeah, given the apparent lack (as far as I can tell) of guidelines for annotating complete genomes, I think the best I can do is rely on the complete microbial genome project page.