Access genomes/proteomes by BioSample ID
3
1
Entering edit mode
2.2 years ago
bvm ▴ 20

I'd like to download multiple genome assemblies or proteomes using a set of BioSample IDs from NCBI.

I'm able to find the assemblies belonging to the BioSample IDS in a browser (in the search field of https://www.ncbi.nlm.nih.gov/), but couldn't find a commandline solution.

E.g. for BioSample SAMN09405588 the assembly id is PDT000806148.1, and from https://www.ncbi.nlm.nih.gov/assembly/GCA_014136285.1/ I can download the proteome: GCA_014136285.1_PDT000806148.1_protein.faa.gz Thank you for your help!

NCBI BioSample assembly • 2.0k views
ADD COMMENT
3
Entering edit mode
2.2 years ago
GenoMax 147k

Using EntrezDirect:

$ esearch -db biosample -query SAMN09405588  | elink -target assembly | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,AssemblyName,FtpPath_GenBank
GCA_014136285.1 PDT000806148.1  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/136/285/GCA_014136285.1_PDT000806148.1

Once you have the assembly accession I will suggest that you use NCBI datasets or a tool like Kai Blin's "ncbi-genome-download"

ADD COMMENT
1
Entering edit mode

Hi, After you retrieve the list of accessions, you can download them using NCBI Datasets like this:

datasets download genome accession --inputfile list.txt

This command will download a zip file with metadata and genomic sequences and (if available), protein, transcript and GFF3 files. Feel free to reach out if you have any questions.

ADD REPLY
2
Entering edit mode
2.2 years ago
bvm ▴ 20

Not the best solution, but still gives usable results (with python):

import requests
from bs4 import BeautifulSoup

html = requests.get("https://www.ncbi.nlm.nih.gov/assembly/?term={}".format(bs)).text
gca = html.split("GenBank assembly accession: </dt><dd>")[1].split()[0]
assembly_id = html.split("<title>")[1].strip().split()[0]
link = "https://ftp.ncbi.nlm.nih.gov/genomes/all/{0}/{1}_{2}/{1}_{2}_protein.faa.gz".format("/".join([gca[:3],gca[4:7],gca[7:10],gca[10:13]]), gca, assembly_id)

Now from the link received one can download the proteome

ADD COMMENT
2
Entering edit mode
2.2 years ago
Sej Modha 5.3k

It would be a two-step process. First, extract the download URL using the eutils and then utilise that URL to fetch genomic, protein or assembly files.

Assembly-specific URLs can be extracted using:

esearch -db assembly -query "SAMN09405588"|esummary|xtract -pattern FtpSites -sep "\n" -element FtpPath |sed -n 2p

This would output: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/136/285/GCA_014136285.1_PDT000806148.1

ADD COMMENT

Login before adding your answer.

Traffic: 2009 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6