NCBI Datasets, the new set of services for downloading genome assembly and annotation data, has redesigned and reorganized web pages to make it easier to find and access the services and documentation you need.
NCBI Datasets has a fresh new homepage highlighting the types of data available through our tools. Available data include genome assemblies, genes, and SARS-CoV-2 genomic and protein data. You can easily access these from the new page or learn more with our new documentation pages.
Our new NCBI Datasets documentation will help you get answers faster. If you are new to Datasets try our Quickstarts to quickly get started using our web pages and tools. How-tos describe common workflows and data requests and provide multiple solutions — our web pages, command-line tools, python and R packages.
For example, if you need to download human genome data, including sequence, annotation and metadata, see the Download genome data How-to guide to get data using the Genomes web page, datasets command-line tool, python and R.
See the full blog post on NCBI Insights.
Hi Istvan,
Thanks for your feedback.
The datasets command-line tool is a work in progress and we will carefully consider your comments as we continue to develop the tool.
One of the main goals of the NCBI Datasets project is to get feedback from the community that will help us improve our tools.
We have interviewed dozens of users throughout the course of development and many of our design decisions have been informed by the feedback that we have received.
We continue to welcome all feedback and we're happy to see the community discuss our tools on Biostars.
We also encourage users to contact us directly with any suggestions or questions by email at info@ncbi.nlm.nih.gov
Thanks, NCBI Datasets Team
my apologies if I came across a little antagonistic - I feel a frustration seeing a good idea taking the wrong turn
I will say this is not about interviewing a few people with various backgrounds - as NCBI you are building a tool for the entire world, and that should not work based on local opinions. It is about following the standards when it comes to command-line interfaces. There is no reason to start to invent "new" methods, especially not free text-like interfaces. Instead of interviewing people, I would recommend looking at how most bioinformatics tools work:
bwa
,bedtools
, each has subcommands, each one is self-documenting, each one has well defined named parameters rather than positional words:and it tells you how it works. There is a reason these tools look like that, long honed during usage.
I do recognize that designing APIs is very hard, especially when it comes to such a gigantic data repository that you already have. But I strongly urge you to re-evaluate what you are doing now. You are designing for local minima, instead of a simple, logical and coherent data model.
Take for example your SARS-COV-2 viral package. Here is how it works:
the command above will download a blob file. How is that any better than
rsync
-ing a prebuilt file like so?It is not! Not only it is not better your method is inferior to
rsync
.rsync
can do differential transfer even on single files. If nothing has changed or just one file was added to the gzip, it will transfer only that.When using
datasets
we have to download the same blob file over and over again. Gigabytes of unnecessary transfer take place each time I want to get the most up-to-date information. even If just one more genome is added, we have to go download the ever-increasing data ... I see this as an impossible race.datasets
should be a tool that tells us where is the file that we need, not a tool to actually download it. There are countless efficient ways to transfer large files of various kinds, the bottleneck is that we don't know what to downlaod.