Dear all, Wish you a Happy new year 2018.
I am trying to parse NCBI gbff files for genomes from NCBI. I am especially interested in coding sequences from one organism at a time irrespective of the plasmid or multiple chromosomes, with a single standard Identifier in CDS fasta headers.
As refseq accession is different for multiple chr. or plasmids of same organism, I can not use them as identifier for one organism. (gbff genbank now contains everything [all chr. and plasmids] in one file)
I have 3 types of unique ids for a bacterium: Escherichia coli strain UCD_JA03
BioProject: PRJNA224116
Assembly: GCF_000599725.1
BioSample: SAMN02650859
What is the best Identifier to move forward with, With no chance of redundancy for each bacterial strain which is common for chr. or plasmids of same strain of bacteria?
(Although I can programically generate an identifier, I want to stick to standard identifier for more clarity.)
How about the NCBI taxonomy id?