Is there an easy way to scrape a whole bunch of NCBI RefSeq FASTA from NCBI?
1
0
Entering edit mode
3 days ago
Mark ▴ 10

I'm looking to get all the NCBI reference genomes for organisms of the following groups:

  • The phylum Prasinodermophyta
  • The phylum Rhodophyta
  • The class Phaeophyceae
  • The phylum Cyanobacteria

I know that for downloading reads, it can be as simple as downloading a bunch of SRA accession numbers off of NCBI and then getting access to all that FASTQ data (using a nice pipeline like nf-core/fetchngs), but is there a similar way to get easy access to all the FASTA data to pull all these genomes off of NCBI?

FASTA Genome NCBI • 230 views
ADD COMMENT
3
Entering edit mode
3 days ago

I think what you want is NCBI datasets.

https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/

download genome taxon --preview prints which genomes it would download, remove the --preview flag to download.

For Rhodophyta,

datasets download genome taxon --preview Rhodophyta

prints

Collecting 68 genome records [================================================] 100% 68/68
{"resource_updated_on":"2024-10-19T03:30:15Z","record_count":68,"estimated_file_size_mb":181,"included_data_files":{"all_genomic_fasta":{"file_count":54,"size_mb":2177.4822}}}

So it would download 68 genomes.

Of these, 3 are RefSeq:

datasets download genome taxon --preview Rhodophyta --assembly-source RefSeq

Check the help, you can also add protein, rna, or other datatypes for the download. Any taxon level is fine as long as it's in NCBI Taxonomy.

ADD COMMENT
0
Entering edit mode

It looks great, I should give it a try. I usually use https://github.com/pirovc/genome_updater .

ADD REPLY

Login before adding your answer.

Traffic: 2046 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6