Question

Number of sequences in RefSeq.

0

Entering edit mode

2.5 years ago

SergFly ▴ 50

Dear colleagues I can not understand. When I download all the genomic sequences from the refseq database, after counting, I see that there are much fewer records than presented in the release (123394 organisms https://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release214.txt). What am I doing wrong?

1. wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt   
2. awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary_refseq.txt > ftpdirpaths           
3. awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths > ftpfilepaths 
4.wget  -i ftpfilepaths
5. gunzip *.gz
6. cat *.fna > /media/sf_G_DRIVE/DataBase/RefSeq_All_2022/Refseq_214.fasta

grep -c '>' Refseq_214.fasta 
79560

Refseq • 1.3k views

ADD COMMENT • link 2.5 years ago by SergFly ▴ 50

score 3 · Accepted Answer · 2022-10-04

RefSeq classifies each genome into one of the following assembly level categories: Complete Genome, Chromosome, Scaffold, Contig. Because your code downloads only complete genomes, the number of downloaded sequences is smaller than the number provided in the RefSeq statistics. Of note, the majority of RefSeq genomes (~80%) are assembled at the scaffold and contig levels.

For downloading RefSeq genomes I recommend using genome_updater. It is a bash script that allows you to download genomes from RefSeq or GenBank with many filters (e.g., according to different assembly levels or taxonomic units). The script tracks changes (it only downloads updated genomes since your last download), allows multithreading, and it has a file integrity check.

score 2 · Accepted Answer · 2022-10-04

That means: there exist RefSeq sequences which are not contained in the set of RefSeq genomes. This is totally expected.

From the release notes:

2.2 Molecule Types Included

The RefSeq release includes genomic, transcript, and protein sequence data; however, these molecule types are not provided for all organisms and the sequences provided may not be complete or comprehensive for some species.

Transcript RefSeq records may represent protein-coding transcripts or non-coding RNA products; these records are currently only provided for eukaryotic species.

Genomic RefSeq records are provided when a sufficient quantity of genomic sequence data is available in DDBJ/EMBL/GenBank. Transcript and protein records may be provided for a species before genomic sequence data is available.