Hi dear community !
Ps : The following question was ofc googled, I came across two biostars posts (see below), but I still need some enlightenments : How to choose NCBI viral database?, How to create a Blast database of viruses ?.
For a metagenomic analysis, I'd like to locally retrieve all bacterial, fungal & viral genomes. Thus I am targeting NCBI genbank (and not RefSeq).
I am following those recipes : ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf, https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#protocols.
Short description of the process :
In the ncbi genbank directory : ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/, we can see : bacteria/, fungi/, viral/. Applying the recipes for the bacteria/ & fungi/ directory was pretty straightforward :
- Locate the assembly_summary.txt file : ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/fungi/assembly_summary.txt
- Retrieve it with curl, get "ftp_path" column content with awk & use sed to create downloadable urls : curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/fungi/assembly_summary.txt' | awk '{FS="\t"} !/^#/ {print $20} ' | sed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/.+/)(GCA_.+)|\1\2/\2_genomic.fna.gz|' > genbank_list_fungus.txt;
- Adapt those urls for rsync : sed -ie 's/ftp:\/\//rsync:\/\//g' genbank_list_fungus.txt;
- Get all (2387) fungal "genomic.fna.gz" genbank genomes : while read line; do rsync --quiet --times $line .; done < genbank_list_fungus.txt;
Things get more complicated for the ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/viral/ directory :
- It does have an assembly_summary.txt file, but it only contain 3 records (for uncultured human fecal virus). There is no other relevant stuff in this directory.
- If you browse the ftp, you will find : ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/. It seems to be a legacy directory, but it does contain a lot of things, so let's try our luck. There is no assembly_summary.txt in here. But there is an all.fna.tar.gz file, which looks like what we are looking for.
- This file contains 4374 directories (each corresponding to a different virus), inside those directories there is a total of 5840 FNA files (some virus have more than 1 associated sequence).
- Retrieve sequences : wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/all.fna.tar.gz; tar -zxvf all.fna.tar.gz; find . -name '*.fna' -exec cat {} \; > ncbi_genome_viruses.fasta;
Let's compare this ncbi_genome_viruses.fasta file with the RefSeq virus :
- Access RefSeq for viruses : ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/
- When you cat viral.1.1 & viral.2.1 genomic.fna files, you obtain a file containing 9334 sequences.
- Comparing this "RefSeq" file with the "genome" file : 9334 vs 5840 sequences, 5719 vs 4220 complete genome sequences. The "genome" file was supposed to contain more files, not the other way around. So there is an issue here.
Last ressource available to my knowledge : https://www.ncbi.nlm.nih.gov/genome/viruses/
- 3 items in the "Download Viral Genome Data" section :
- "Complete RefSeq release of viral and viroid sequences" <=> the link we previously used for RefSeq sequences (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/)
- "Accession list of all viroid genomes" (not interested)
- "Accession list of all viral genomes", which point to a file containing 114949 entries (accession number).
Final questions / options :
- How to retrieve all viruses genomes (not simply RefSeq genomes) ?
- What did go wrong with my search on NCBI ftp to retrieve genbank viruses genomes ?
- Shall I use the list of accession numbers available via https://www.ncbi.nlm.nih.gov/genome/viruses/ => "Accession list of all viral genomes" to retrieve all associated sequences via entrez ?
- Is this an option : https://www.ncbi.nlm.nih.gov/nuccore/?term=Viruses%5BOrganism%5D+AND+srcdb_genbank%5BPROP%5D ?
Best regards
viral.1.1 & viral.2.1 contain entries such as:
"Accession list of all viral genomes" has that many entries, but it's a neigbours file. When you sort -u on first column you're left with 9,096 entries. Meanwhile EBI lists 4,026 complete virus genomes.
I think you should be perfectly fine with ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/all.fna.tar.gz It's not a legacy dir. The last time that file was updated was today..