Number of genomes sequences in NCBI
3
0
Entering edit mode
5.9 years ago
misterie ▴ 110

Hi,

Could you tell me how can I check number of organisms for which full genome sequences are available within NCBI? Also I would like to check how many mammals genomes are available in NCBI.

Thank you in advance.

genome ncbi • 1.8k views
ADD COMMENT
2
Entering edit mode
5.9 years ago
GenoMax 148k

Parse the files to get "complete genomes" or any other criteria you are looking for.

ADD COMMENT
0
Entering edit mode

Thank you, but how I can I check number of genome available for Mammals? In this file there is no field that I can use for filtering Mammals...

ADD REPLY
2
Entering edit mode

Go to this page. Click to add a filter for mammals. Hit search. Looks like there are 282 at the moment (Feb 19).

ADD REPLY
1
Entering edit mode

RefSeq assembly_summary.txt files for broad categories such as vertebrate_mammalian are present in corresponding directories in this path: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ For example, the assembly_summary.txt file for the vertebrate_mammalian is here: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/assembly_summary.txt

ADD REPLY
1
Entering edit mode
5.9 years ago

use the 6th column to identify the taxonomy. Then to process ids by taxonomy use the taxonkit and csvtk

get the mammalian ids

taxonkit list --ids 40674 --indent "" | grep . > mammalian.ids.txt

filter the assembly file

wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt
cat assembly_summary_genbank.txt | cut -f 1,6,8 | csvtk -t grep -f 2 -P  mammalian.ids.txt > mammalian.txt

this will give you 649 genomes, but not all are unique taxids

cat mammalian.txt | cut -f 2 | sort | uniq -c | sort -rn | wc -l

produces:

 297
ADD COMMENT
0
Entering edit mode

one of those code examples that initially looked easy, I knew what needed to be done - but was a lot more frustrating to accomplish and warrants a bug report - the taxonkit list adds an empty line to the file which in turn will match everything on grep - so one also needs to filter the empty lines the grep . .... typical bioinformatics gotcha

ADD REPLY
1
Entering edit mode
5.9 years ago
vkkodali_ncbi ★ 3.8k

If all you are after is a count (and not particularly interested in downloading the assemblies), you can do this from the NCBI Assembly web portal. This way, you won't have to bother downloading and installing a couple of other programs on your machine if you don't want to.

  1. Search for the following in NCBI Assembly: mammals[Organism] AND latest_genbank[Properties]. This will return 633 assemblies. You can search for other broad categories such as 'rodents', 'plants', etc as well.

  2. There may be multiple assemblies submitted for a single species. So, you need to flatten this list to a unique set of taxa. You can do this by following the link to Taxonomy page. On the right hand side of panel of the Assembly results page you will notice a 'Find related data' facet with a drop-down list of databases beneath it. From that drop-down list, choose 'Taxonomy' and click the 'Find items' button. You will be directed to a Taxonomy results page. The count there, 295, is what you are looking for.

I am not entirely sure why there is a discrepancy of 2, compared to the taxonkit/csvtk method described above; I did not download the two programs and run them on my own.

ADD COMMENT
0
Entering edit mode

Curious as to why the genomes page I linked above has only 283. One more today than yesterday. It does not match what you/Istvan see.

ADD REPLY
0
Entering edit mode

I have yet another one: 294

This is based on eukaryotes.txt in genome_reports instead of the assembly_reports (filtered on SubGroup for mammals and counting unique organism names)

Most likely the rates are at which the files are refreshed are different.

ADD REPLY
0
Entering edit mode

the problem with NCBI interfaces is that one is never quite sure what they do behind the scenes - then it is always "click this", "click that" - by the end there is no indication of whether one did it right - you end up with a number, not quite sure what happened along the way,

I passionately hate the NCBI data interfaces for the reasons I list above - more than any other factor it is the hare-brained data models and interfaces at NCBI that fuel confusion and lack of reproducibility

ADD REPLY

Login before adding your answer.

Traffic: 2111 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6