how to batch download SARS-CoV-2 sequences data from NCBI?
2
0
Entering edit mode
4.0 years ago
2001linana ▴ 40

Hi, I was trying to download SARS-CoV-2 sequences data from NCBI following this link: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049 When I click the empty box, I can only get like 200 sequences, each time. So I was wondering, is there a way to batch download all the genome sequences data with a click? Many thanks. I thought I did this earlier, but I do not quite recall.

sequence sequencing • 3.5k views
ADD COMMENT
1
Entering edit mode

You can get the assembly ids, and download from the ftp, for example:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.3_ASM985889v3/

ADD REPLY
0
Entering edit mode

Many thanks for your kind reply. Could you be a bit more specific then? Many thanks.

ADD REPLY
0
Entering edit mode

I clicked on the link you posted, clicked on the tab for Refseq Genome, clicked on the assembly:

https://www.ncbi.nlm.nih.gov/assembly/GCF_009858895.2

Then clicked on FTP directory for GenBank assembly

You can get the fasta sequence by clicking on

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.fna.gz

And gene informations (gff format):

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.gff.gz

ADD REPLY
0
Entering edit mode

I have downloaded those sequences, as you mentioned in march 2021; currently I am trying to download them again though I have faced errors and the download has failed any time I tried. I checked the NCBI command line, ENTREZ and viral datasets too, Do you have any other solution or Do you know any other available resource for SARS CoV 2 nucleotide and amino acid sequences?

ADD REPLY
1
Entering edit mode

I am downloading it now using datasets download genome taxon sars-cov-2 --filename virus.zip without any issues. There are close to 340,000 genomes for SARS as of today.

Edit: The final file was 8.8 G.

ADD REPLY
1
Entering edit mode
4.0 years ago
vkkodali_ncbi ★ 3.8k

You can use NCBI Datasets for this. A dedicated page for Coronavirus Datasets is available. If you would prefer, a command line tool is also available. For example, you can use the command line tool to download SARS-Cov2 data as shown below:

datasets download virus genome taxon sars-cov-2 --complete-only --filename virus.zip
ADD COMMENT
0
Entering edit mode

It looks like NCBI has 12 genomes of the original SARS virus (SARS total minus SARS-CoV-2). Can those be separately categorized in a link on the genome page?

Update: If I change the setting to All hosts from human it now shows 30246 SARS genomes but no SARS-CoV-2. Something does not seem right.

ADD REPLY
0
Entering edit mode
4.0 years ago
GenoMax 147k

I click the empty box, I can only get like 200 sequences, each time.

Try this. Do not click any boxes. Click on Download button at top. In step 2 Download All Records should be automatically selected. This downloads ALL sequences. As of today that number stands at 43676 genomes (~1.2 GB file).

ADD COMMENT

Login before adding your answer.

Traffic: 1928 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6