I'd like to download multiple genome assemblies or proteomes using a set of BioSample IDs from NCBI.
I'm able to find the assemblies belonging to the BioSample IDS in a browser (in the search field of https://www.ncbi.nlm.nih.gov/), but couldn't find a commandline solution.
E.g. for BioSample SAMN09405588 the assembly id is PDT000806148.1, and from https://www.ncbi.nlm.nih.gov/assembly/GCA_014136285.1/ I can download the proteome: GCA_014136285.1_PDT000806148.1_protein.faa.gz
Thank you for your help!
This command will download a zip file with metadata and genomic sequences and (if available), protein, transcript and GFF3 files. Feel free to reach out if you have any questions.
It would be a two-step process. First, extract the download URL using the eutils and then utilise that URL to fetch genomic, protein or assembly files.
Hi, After you retrieve the list of accessions, you can download them using NCBI Datasets like this:
datasets download genome accession --inputfile list.txt
This command will download a zip file with metadata and genomic sequences and (if available), protein, transcript and GFF3 files. Feel free to reach out if you have any questions.