Question

What is unique identifier for organism in NCBI gbff file?

0

Entering edit mode

7.0 years ago

Naren ▴ 1000

Dear all, Wish you a Happy new year 2018.

I am trying to parse NCBI gbff files for genomes from NCBI. I am especially interested in coding sequences from one organism at a time irrespective of the plasmid or multiple chromosomes, with a single standard Identifier in CDS fasta headers. As refseq accession is different for multiple chr. or plasmids of same organism, I can not use them as identifier for one organism. (gbff genbank now contains everything [all chr. and plasmids] in one file)
I have 3 types of unique ids for a bacterium: Escherichia coli strain UCD_JA03
BioProject: PRJNA224116
Assembly: GCF_000599725.1
BioSample: SAMN02650859
What is the best Identifier to move forward with, With no chance of redundancy for each bacterial strain which is common for chr. or plasmids of same strain of bacteria?
(Although I can programically generate an identifier, I want to stick to standard identifier for more clarity.)

genome sequence • 2.5k views

ADD COMMENT • link updated 7.0 years ago by Michael 55k • written 7.0 years ago by Naren ▴ 1000

0

Entering edit mode

How about the NCBI taxonomy id?

ADD REPLY • link 7.0 years ago by Michael 55k

score 0 · Answer 1 · 2018-01-01

0

Entering edit mode

7.0 years ago

Michael 55k

Astonishingly none of the above ;) I would use the Assembly, but there is an updated version GCF_000599725.2 Then you can also put the species and strain into the Fasta header as a further reference.

Escherichia coli strain UCD_JA03

Not found in the NCBI taxonomy.

BioProject: PRJNA224116

Is a multi-species project and won't help to distinguish species or strains.

Assembly:

GCF_000599725.1

Most unique and specific, but there is an update.

BioSample: SAMN02650859

Does not indicate the assembly version

ADD COMMENT • link 7.0 years ago by Michael 55k

0

Entering edit mode

Using assembly accession numbers may be the best thing, though they are not single sequence records. NCBI assembly database which provides stable accessioning and data tracking for genome assembly data points back to that number. There are some other ID's which may be searchable through eUtils (IDs: 569591 [UID] 2551208 [GenBank] 2599488 [RefSeq])

ADD REPLY • link 6.9 years ago by GenoMax 148k

0

Entering edit mode

Thanks for your insights. I just wanted uniqueness within a gbff file (all chr.s and plasmids). And not common for different strain of desired organism. The different assembly version of same strain may not be a problem, as I will find orthologous clusters. What would you suggest considering this.

ADD REPLY • link 6.9 years ago by Naren ▴ 1000