What does 'complete genome' in NCBI include
1
0
Entering edit mode
8.2 years ago

Hey,

I have a quite basic question but do not really find an answer online. When an NCBI sequence is denoted as 'complete genome', what does it actually contain? Assuming we have a bacterial sequence, will it contain only the chromosomal sequence? or does it contain chromosomal and plasmid sequences, and thus the complete DNA found in the cell?

ncbi complete-genome • 2.5k views
ADD COMMENT
0
Entering edit mode
8.2 years ago
5heikki 11k

Applies to all assemblies:

   *_genomic.fna.gz file
       FASTA format of the genomic sequence(s) in the assembly. Repetitive 
       sequences in eukaryotes are masked to lower-case (see below).
       The FASTA title is formatted as sequence accession.version plus 
       description. The genomic.fna.gz file includes all top-level sequences in
       the assembly (chromosomes, plasmids, organelles, unlocalized scaffolds,
       unplaced scaffolds, and any alternate loci or patch scaffolds). Scaffolds
       that are part of the chromosomes are not included because they are
       redundant with the chromosome sequences; sequences for these placed 
       scaffolds are provided under the assembly_structure directory.

ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/README.txt

ADD COMMENT
0
Entering edit mode

While it says that it may not be true (any longer?). See the related thread by @wanderingstefan: Download complete bacterial genomes and associated plasmid sequences from NCBI

ADD REPLY
0
Entering edit mode

Can you name at least one example where the above does not apply?

ADD REPLY
0
Entering edit mode

Since @wanderingstefan had posted this and other thread I (wrongly) assumed that it was done after due diligence. On double checking it does look like the "genomic.fna.gz" file contains associated plasmid sequences.

ADD REPLY
0
Entering edit mode

His problem was going through entrez. I'm pretty sure nobody even at the NCBI knows comprehensively how entrez queries work. At least it's not documented fully anywhere.

ADD REPLY
0
Entering edit mode

I am a little confused here. Does the above answer also apply to complete genomes downloaded from the 'nucleotide' database at ncbi? My statement that plasmid sequences are not contained in the 'complete genome' files from the 'nucleotide' database was based on a blast search of some whole genomes against a blast database containing the sequences of all plasmids at the ncbi refseq and thereafter calculating sequence coverage for the plasmids. May I ask how you checked @genomax2?

edit: I have to add that I terminated the analysis after around 100 random genomes, as I was unable to identify plasmids in any of them. I will check this again.

ADD REPLY
0
Entering edit mode

It applies when you look your assemblies of interest from this large file (do not open in browser!) and then download the "*_genomic.fna.gz" file that can be found from within the ftp directory specified by column 20 of said file.

ADD REPLY
0
Entering edit mode

Hey, thanks for the clarification. Yes, for those files all plasmid sequences are in there, I also found it and downloaded the suitable assemblies.

ADD REPLY

Login before adding your answer.

Traffic: 1727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6