Hi,
Could you tell me how can I check number of organisms for which full genome sequences are available within NCBI? Also I would like to check how many mammals genomes are available in NCBI.
Thank you in advance.
Hi,
Could you tell me how can I check number of organisms for which full genome sequences are available within NCBI? Also I would like to check how many mammals genomes are available in NCBI.
Thank you in advance.
Parse the files to get "complete genomes" or any other criteria you are looking for.
use the 6th column to identify the taxonomy. Then to process ids by taxonomy use the taxonkit
and csvtk
get the mammalian ids
taxonkit list --ids 40674 --indent "" | grep . > mammalian.ids.txt
filter the assembly file
wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt
cat assembly_summary_genbank.txt | cut -f 1,6,8 | csvtk -t grep -f 2 -P mammalian.ids.txt > mammalian.txt
this will give you 649 genomes, but not all are unique taxids
cat mammalian.txt | cut -f 2 | sort | uniq -c | sort -rn | wc -l
produces:
297
one of those code examples that initially looked easy, I knew what needed to be done - but was a lot more frustrating to accomplish and warrants a bug report - the taxonkit list
adds an empty line to the file which in turn will match everything on grep - so one also needs to filter the empty lines the grep .
.... typical bioinformatics gotcha
If all you are after is a count (and not particularly interested in downloading the assemblies), you can do this from the NCBI Assembly web portal. This way, you won't have to bother downloading and installing a couple of other programs on your machine if you don't want to.
Search for the following in NCBI Assembly: mammals[Organism] AND latest_genbank[Properties]
. This will return 633 assemblies. You can search for other broad categories such as 'rodents', 'plants', etc as well.
There may be multiple assemblies submitted for a single species. So, you need to flatten this list to a unique set of taxa. You can do this by following the link to Taxonomy page. On the right hand side of panel of the Assembly results page you will notice a 'Find related data' facet with a drop-down list of databases beneath it. From that drop-down list, choose 'Taxonomy' and click the 'Find items' button. You will be directed to a Taxonomy results page. The count there, 295, is what you are looking for.
I am not entirely sure why there is a discrepancy of 2, compared to the taxonkit/csvtk method described above; I did not download the two programs and run them on my own.
I have yet another one: 294
This is based on eukaryotes.txt in genome_reports instead of the assembly_reports (filtered on SubGroup for mammals and counting unique organism names)
Most likely the rates are at which the files are refreshed are different.
the problem with NCBI interfaces is that one is never quite sure what they do behind the scenes - then it is always "click this", "click that" - by the end there is no indication of whether one did it right - you end up with a number, not quite sure what happened along the way,
I passionately hate the NCBI data interfaces for the reasons I list above - more than any other factor it is the hare-brained data models and interfaces at NCBI that fuel confusion and lack of reproducibility
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you, but how I can I check number of genome available for Mammals? In this file there is no field that I can use for filtering Mammals...
Go to this page. Click to add a filter for
mammals
. Hitsearch
. Looks like there are 282 at the moment (Feb 19).RefSeq assembly_summary.txt files for broad categories such as vertebrate_mammalian are present in corresponding directories in this path: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ For example, the assembly_summary.txt file for the vertebrate_mammalian is here: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/assembly_summary.txt