Question

How would I use BioPython to mass download some assemblies from NCBI?

0

Entering edit mode

4.3 years ago

Tom ▴ 20

So here's the situation. I have a spreadsheet of a few genome assemblies I need to pull form NCBI. I have the accession numbers for them, like "GCF_003031525" in a row (said accession number leads to https://www.ncbi.nlm.nih.gov/assembly/GCF_003031525.1/)

And I just need to download a bunch of assemblies (a few dozen) where I change the assembly variable, and I can get it all on my drive.

I hear BioPython can access NCBI and do this, I was kind of wondering how to prime this or if anyone has already done something this automated for a list of assemblies they have.

ncbi biopython • 2.1k views

ADD COMMENT • link updated 4.3 years ago by GenoMax 148k • written 4.3 years ago by Tom ▴ 20

score 0 · Answer 1 · 2020-09-28

0

Entering edit mode

4.3 years ago

JC 13k

Entrez tools can be used to avoid coding.

ADD COMMENT • link 4.3 years ago by JC 13k

1

Entering edit mode

It would look like this with entrez direct:

esearch -db assembly -query GCA_003031525 | elink -target nuccore | efetch -format fasta > out.fa

ADD REPLY • link 4.3 years ago by Istvan Albert 102k

0

Entering edit mode

If you specifically want to incorporate this in to a (Bio)Python script, Biopython has a submodule for Entrez. The syntax is very similar.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec139

ADD REPLY • link 4.3 years ago by Joe 22k

score 0 · Answer 2 · 2020-09-28

No need to use Biopython.

To mass download assemblies you can use the FTP site (note the links to the FTP on the right hand side bar) and tools such as wget or curl from locations such as:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/031/525/GCF_003031525.1_Neophocaena_asiaeorientalis_V1/

alternatively, there are also scripts to streamline the process:

https://github.com/kblin/ncbi-genome-download

score 0 · Answer 3 · 2020-09-28

0

Entering edit mode

4.3 years ago

GenoMax 148k

Since non-python solutions have been mentioned, consider NCBI datasets. It is the command line tool for downloads of genomic data from NCBI.

Note: Limited to eukaryotic genomes via the web interface and to other constraints mentioned at the link posted by @Istvan in comment below.

ADD COMMENT • link 4.3 years ago by GenoMax 148k

0

Entering edit mode

I have been looking at datasets. I am not too pleased with it so far, it feels like a half baked solution that has no champion.

It is not documented properly beyond a few examples. In addition, the command line interface is bit verbose and rudimentary. But this post also demonstrates my biggest gripe with it.

Let's check what happens for the accession number that the original poster needs:

   datasets download assembly GCF_003031525

it prints:

Some of the accessions provided ('GCF_003031525') are invalid NCBI Assembly Accessions.

See https://www.ncbi.nlm.nih.gov/datasets/docs/which-genomes-are-in-datasets/ for more information.

ok, let's go to the website. Here is the first message there:

NCBI Datasets has been designed to give scientists the data that they want--which means we are leaving out some of the data that we think most users won't need.

So NCBI thinks you should not need that accession above, so they won't even bother including it, let's be serious now, what kind of scientist studies GCF_003031525 anyway.

Here is a command-line tool that will not give you all information because ... seemingly they don't want to bother with things that are not popular.

ADD REPLY • link 4.3 years ago by Istvan Albert 102k

0

Entering edit mode

There is an explanation of what is excluded at the link you included above. So they are not doing this without telling users. They also tell the users where the missing excluded genomes can be found.

This is one additional tool like the others mentioned in this thread. It comes with its own limitations. One major being access to only eukaryotic genomes via web interface.

ADD REPLY • link 4.3 years ago by GenoMax 148k

0

Entering edit mode

What I find super irritating is that error message says: invalid NCBI Assembly Accessions.

Are these really invalid NCBI Assembly Accessions or are these valid only that they chose not to include them?

We don't know, need to manually search NCBI.

One should not need to copy-paste links from an error message in a terminal then visit NCBI and search just to figure out that their accession is actually valid or not and that some data was just deliberately not included because "NCBI Datasets has been designed to give scientists the data that they want"

Perhaps it is the wording of that help message that ticks me off most.

ADD REPLY • link 4.3 years ago by Istvan Albert 102k

1

Entering edit mode

EDIT: Curiously using the fully qualified accession number (with version) works fine, so that error message is not appropriate (accession number per se is not invalid):

$ ./datasets download assembly GCF_003031525.1
Downloading: ncbi_dataset.zip    836kB 1.12MB/s

So someone must be doing an over-zealous/literal check for matches (perhaps thinking here is that you will identify a specific accession and then use it for downloads, who knows).

There is a way to send feedback:

We welcome feedback from the community. Please send any questions, comments or ideas to info@ncbi.nlm.nih.gov

ADD REPLY • link 4.3 years ago by GenoMax 148k

0

Entering edit mode

nice job tracking that down, looks like it does work in the end

Usually one would not add the version, to ensure they get the latest build ... I guess here it really wants it

ADD REPLY • link 4.3 years ago by Istvan Albert 102k

0

Entering edit mode

$ ./datasets assembly-descriptors taxon "Neophocaena asiaeorientalis"
{"assemblies":[{"assembly":{"annotation_metadata":{"file":[{"estimated_size":"13363717","type":"GENOME_GFF"},{"estimated_size":"966592048","type":"GENOME_GBFF"},{"estimated_size":"24351233","type":"RNA_FASTA"},{"estimated_size":"7796429","type":"PROT_FASTA"}],"name":"NCBI Annotation Release 100","release_date":"Apr 12, 2018","release_number":"100","report_url":"https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Neophocaena_asiaeorientalis_asiaeorientalis/100/","source":"NCBI"},"assembly_accession":"GCF_003031525.1","assembly_category":"representative genome","assembly_level":"Scaffold","chromosomes":["Un","MT"],"contig_n50":86003,"display_name":"Neophocaena_asiaeorientalis_V1","estimated_size":"1672080519","org":{"assembly_counts":{"node":2,"subtree":2},"breed":"wild","common_name":"Yangtze finless porpoise","key":"1706337","parent_tax_id":"189058","rank":"SUBSPECIES","sci_name":"Neophocaena asiaeorientalis asiaeorientalis","sex":"male","tax_id":"1706337","title":"Yangtze finless porpoise"},"seq_length":"2284611699","submission_date":"2018-04-03"}},{"assembly":{"annotation_metadata":{},"assembly_accession":"GCA_003031525.1","assembly_category":"representative genome","assembly_level":"Scaffold","chromosomes":["Un"],"contig_n50":86003,"display_name":"Neophocaena_asiaeorientalis_V1","estimated_size":"659931204","org":{"assembly_counts":{"node":2,"subtree":2},"breed":"wild","common_name":"Yangtze finless porpoise","key":"1706337","parent_tax_id":"189058","rank":"SUBSPECIES","sci_name":"Neophocaena asiaeorientalis asiaeorientalis","sex":"male","tax_id":"1706337","title":"Yangtze finless porpoise"},"seq_length":"2284611699","submission_date":"2018-04-03"}}],"total_count":2}

Assembly accession is embedded in that output.

$ ./datasets assembly-descriptors taxon "Neophocaena asiaeorientalis" | jq | grep assembly_accession
        "assembly_accession": "GCF_003031525.1",
        "assembly_accession": "GCA_003031525.1",

ADD REPLY • link 4.3 years ago by GenoMax 148k