Question

Retrieve genbank viral genomes

9

Entering edit mode

8.1 years ago

erwan.scaon ▴ 960

Hi dear community !

Ps : The following question was ofc googled, I came across two biostars posts (see below), but I still need some enlightenments : How to choose NCBI viral database?, How to create a Blast database of viruses ?.

For a metagenomic analysis, I'd like to locally retrieve all bacterial, fungal & viral genomes. Thus I am targeting NCBI genbank (and not RefSeq).

I am following those recipes : ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf, https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#protocols.

Short description of the process :

In the ncbi genbank directory : ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/, we can see : bacteria/, fungi/, viral/. Applying the recipes for the bacteria/ & fungi/ directory was pretty straightforward :

Locate the assembly_summary.txt file : ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/fungi/assembly_summary.txt
Retrieve it with curl, get "ftp_path" column content with awk & use sed to create downloadable urls : curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/fungi/assembly_summary.txt' | awk '{FS="\t"} !/^#/ {print $20} ' | sed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/.+/)(GCA_.+)|\1\2/\2_genomic.fna.gz|' > genbank_list_fungus.txt;
Adapt those urls for rsync : sed -ie 's/ftp:\/\//rsync:\/\//g' genbank_list_fungus.txt;
Get all (2387) fungal "genomic.fna.gz" genbank genomes : while read line; do rsync --quiet --times $line .; done < genbank_list_fungus.txt;

Things get more complicated for the ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/viral/ directory :

It does have an assembly_summary.txt file, but it only contain 3 records (for uncultured human fecal virus). There is no other relevant stuff in this directory.
If you browse the ftp, you will find : ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/. It seems to be a legacy directory, but it does contain a lot of things, so let's try our luck. There is no assembly_summary.txt in here. But there is an all.fna.tar.gz file, which looks like what we are looking for.
This file contains 4374 directories (each corresponding to a different virus), inside those directories there is a total of 5840 FNA files (some virus have more than 1 associated sequence).
Retrieve sequences : wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/all.fna.tar.gz; tar -zxvf all.fna.tar.gz; find . -name '*.fna' -exec cat {} \; > ncbi_genome_viruses.fasta;

Let's compare this ncbi_genome_viruses.fasta file with the RefSeq virus :

Access RefSeq for viruses : ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/
When you cat viral.1.1 & viral.2.1 genomic.fna files, you obtain a file containing 9334 sequences.
Comparing this "RefSeq" file with the "genome" file : 9334 vs 5840 sequences, 5719 vs 4220 complete genome sequences. The "genome" file was supposed to contain more files, not the other way around. So there is an issue here.

Last ressource available to my knowledge : https://www.ncbi.nlm.nih.gov/genome/viruses/

3 items in the "Download Viral Genome Data" section :
"Complete RefSeq release of viral and viroid sequences" <=> the link we previously used for RefSeq sequences (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/)
"Accession list of all viroid genomes" (not interested)
"Accession list of all viral genomes", which point to a file containing 114949 entries (accession number).

Final questions / options :

How to retrieve all viruses genomes (not simply RefSeq genomes) ?
What did go wrong with my search on NCBI ftp to retrieve genbank viruses genomes ?
Shall I use the list of accession numbers available via https://www.ncbi.nlm.nih.gov/genome/viruses/ => "Accession list of all viral genomes" to retrieve all associated sequences via entrez ?
Is this an option : https://www.ncbi.nlm.nih.gov/nuccore/?term=Viruses%5BOrganism%5D+AND+srcdb_genbank%5BPROP%5D ?

Best regards

ncbi genbank viral-genome • 5.1k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 8.1 years ago by erwan.scaon ▴ 960

0

Entering edit mode

viral.1.1 & viral.2.1 contain entries such as:

>ref|NC_021094.1| White clover cryptic virus 2 isolate IPP_Lirepa segment RNA 1, complete sequence
>ref|NC_021095.1| White clover cryptic virus 2 isolate IPP_Lirepa segment RNA 2, complete sequence
>ref|NC_021096.1| Red clover cryptic virus 2 isolate IPP_Nemaro segment RNA 1, complete sequence
>ref|NC_021097.1| Red clover cryptic virus 2 isolate IPP_Nemaro segment RNA 2, complete sequence
>ref|NC_021098.1| Hop trefoil cryptic virus 2 isolate IPP_GelbSK segment RNA 1, complete sequence
>ref|NC_021099.1| Hop trefoil cryptic virus 2 isolate IPP_GelbSK segment RNA 2, complete sequence

"Accession list of all viral genomes" has that many entries, but it's a neigbours file. When you sort -u on first column you're left with 9,096 entries. Meanwhile EBI lists 4,026 complete virus genomes.

I think you should be perfectly fine with ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/all.fna.tar.gz It's not a legacy dir. The last time that file was updated was today..

ADD REPLY • link 8.1 years ago by 5heikki 11k

lakhujanivijay · Answer 1 · 2017-06-27

When you go to https://www.ncbi.nlm.nih.gov/genome/viruses => "Accession list of all viral genomes" => taxid10239.nbr, this is indeed a neighbours file. I think it contains a little more than 9096 entries, because some lines have multiples accession numbers :

awk -F "\t" '!/^#/ {print $1}' taxid10239.nbr > ncbi_genome_viruses_allhost.txt;
sed -i 's/,/\n/g' ncbi_genome_viruses_allhost.txt;
cat ncbi_genome_viruses_allhost.txt | sort | uniq > ncbi_genome_viruses_allhost_AN.txt;
sed -i '/^$/d' ncbi_genome_viruses_allhost_AN.txt;
wc -l ncbi_genome_viruses_allhost_AN.txt; => 9216

Regarding the all.fna.tar.gz file, I still have some doubts, esp when we compared it to the RefSeq file :

refseq_file <=> ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral (*genomic.fna.gz)
genome_file <=> ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses (all.fna.tar.gz)
After retrieving all accession numbers (grep -Po '\|((NC|AC).*)\|') in each of those files (+ sort | uniq), use the comm command :
- comm refseq_file genome_file > results.tsv
Results : 3494 accessions are unique to refseq_file, 0 unique to genome_file & 5840 common to both.

Thus I plan to use the refseq_file (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral (*genomic.fna.gz)), since it seems to contains all entries in the genome_file. But this still doesn't look fine to me, since I was hoping for "a true" genome/genbank file, i.e. a file with significantly more sequences than the RefSeq file.