Now that NCBI has given up assigning sub species level taxonomy ids (taxids), I am revisiting our method of building genomes from the *.gbk files at ftp.ncbi.nih.gov/genomes/Bacteria In the past I hoped that taxids would eventually allow exact matching of sequence to the source organism, but this won't happen now.
So I'm looking for a new method. The files in the directory are organized by bioproject_id (as a suffix to one organism within the project). In this directory of completed genomes, most projects refer to one organism, but some have multiple strains of the same species, and some have multiple organisms found in the same sample.
I thought I could match either the organism and/or the strain fields of the source feature (the first Feature), despite the pitfalls of matching text fields. But I see examples in which these fields differ slightly, even though they refer to the same organism:
eg:
Haloquadratum_walsbyi_C23_uid162019/NC_017457.gbk: /strain="DSM 16854" -- (a plasmid)
Haloquadratum_walsbyi_C23_uid162019/NC_017459.gbk: /strain="DSM 16854 = C23" -- (a chromosome)
The upshot seems to be that there is no unique key to identify source organisms. Am I wrong?
I'll probably write some code to identify these situations and manually sort them out, but ugh.
Any other suggestions to group the *.gbk files in a single directory by source organism? Ideally a solution would work for the more chaotic Bacteria_DRAFT ftp directory as well.
Thanks for that thought -- I agree the /plasmid tag can be useful.. By an organism's genome, I mean all the DNA carried by an organism, plasmids, chromosomes, and prophages, so I want to match all of these gbk files to the organism (host if you prefer) from which they were sequenced. Separately, I agree, sometimes it is unclear by source tags which is a chromosome (or 'complete genome').
So what is the problem then? Each folder contains only the information for a single strain.
The
DSM 16854 = C23
entry under/strain
is simply saying that they're equivalent names for the same strain. C23 is the name given to the strain, DSM16854 is an identifier given to strain C23 by the DSMZ Bacteria Collection. They both point to the same thing.https://www.dsmz.de/catalogues/details/culture/DSM-16854.html
Nope, that's what I had hoped (and then it would be simple), but folders sometimes contain multiple strains, or even multiple unrelated organisms, such as:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Vibrio_parahaemolyticus_O1_K33_CDC_K4557_uid212977
and yes, in the example it is easy enough to read the differing strain info for the chromosome and plasmid, and understand they are the same, but getting code to do that is not so easy...
You should be able to use taxid to handle the case of multiple species in a single directory. If the taxid for a file fails to match that of the directory species, toss it. I guess you could limit to one genome per folder. Parsing by organism name is probably more work than it's worth.
For handling strain naming problems there's no obvious option outside of downloading from all the databases and making a look up table. Even this might not provide for full confidence, it is clear that the strain annotations are inconsistently implemented.
I'm not sure that there is a single point to filter on that will provide you with what you need. I've never had a good experience filtering .gbk files, especially when pulling them from the bacteria ftp (it was a few years ago the last time I did). I've never been fully confident of whatever filtering kludge I worked up, I usually end up at reading through the data set in the end just to be sure. Unless you're writing software you plan to distribute, or you need frequent updates, I'd wager that you will save time by just manually checking the files.
Yeah, similar history here. One kludge after another. I find it disheartening that NCBI has given up on subspecies taxids, as there is no controlled vocabulary for these. I could toss files, but my goal is to parse all available prok genomes, and that Vibrio example may be a harbinger of what is to come -- a mix of species and strains all within one bioproject directory. I'm still hoping someone will have a solution we haven't thought of ...
http://jgi.doe.gov/ Has done a much better job with sequence annotation and curation, you might want to check there instead. However they do seem to be having some database issues currently.
Thanks. I've used them in the past, but need to be more current (want to have at least 95% of completed genomes already deposited in NCBI). I'll check there again though.