Having issues downloading reference bacteria genome from NCBI FTP website
1
0
Entering edit mode
2.9 years ago
krastegar0 • 0

Hi everyone I am new to bioinformatics and I am working on my thesis project which requires me to download reference bacteria genome from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt. I am super green in this field so I really don't know what I am doing. Here is the code that I was given to download the raw fastq files.

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt grep 'Complete Genome' assembly_summary.txt \ assembly_summary_complete_latest_reference_genomes.txt awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt \ assembly_summary_complete_latest_reference_genomes_paths.txt mkdir BacterialGenomes for i in $(cat assembly_summary_complete_latest_reference_genomes_paths.txt) do wget -P BacterialGenomes ${i}/*genomic.fna.gz done

When I run this script I get stuck in an infinite loop with the same error messages (posted below): I am using Linux with Ubuntu (just in case anyone is wondering).

Warning: wildcards not supported in HTTP. --2021-12-25 21:14:23-- https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/157/365/GCA_002157365.2_ASM215736v2/*genomic.fna.gz Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.230, 130.14.250.10, 2607:f220:41f:250::230, ... Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.230|:443... connected. HTTP request sent, awaiting response... 404 Not Found 2021-12-25 21:14:23 ERROR 404: Not Found.

Thank you for any help you may be able to provide!

wget troubleshooting Linux • 2.2k views
ADD COMMENT
0
Entering edit mode

I also tried doing this in R using

biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome")

but I get an error saying

The FTP link: 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/128/725/GCF_900128725.1_BCifornacula_v1.0/GCF_900128725.1_BCifornacula_v1.0_genomic.fna.gz' seems not to be available at the moment. This might be due to an instable internet connection, a firewall issue, or wrong organism name. Could you please try to re-run the function to see whether it works now?
The FTP link: 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/128/725/GCF_900128725.1_BCifornacula_v1.0/md5checksums.txt' seems not to be available at the moment. This might be due to an instable internet connection, a firewall issue, or wrong organism name. Could you please try to re-run the function to see whether it works now?
Genome download of Buchnera_aphidicola is completed!
The download session seems to have timed out at the FTP site 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/128/725/GCF_900128725.1_BCifornacula_v1.0/GCF_900128725.1_BCifornacula_v1.0_genomic.fna.gz'. This could be due to an overload of queries to the databases. Please restart this function to continue the data retrieval process or wait for a while before restarting this function in case your IP address was logged due to an query overload on the server side.
Error: Please provide a valid file path to your genome assembly file.                                                                        
In addition: There were 11 warnings (use warnings() to see them)
ADD REPLY
3
Entering edit mode
2.9 years ago
MirianT_NCBI ▴ 760

Hi, Based on your question, I assume you're trying to download all bacterial reference genomes from NCBI, right? You can use the NCBI datasets command line tool (https://www.ncbi.nlm.nih.gov/datasets/docs/v1/quickstarts/command-line-tools/) for that. Here's the GitHub page with more info if that's helpful.

After you download the program (which can also be installed using conda), here are the steps:

  1. Download a dehydrated data package that contains metadata and the paths to all reference bacterial genomes (as reference, I'm assuming you mean all bacterial genomes with GCF accession numbers):

    datasets download genome bacteria --assembly-source refseq --dehydrated --filename bacteria_refseq.zip

  1. Unzip the file
    unzip bacteria_refseq.zip -d bacteria_refseq
  1. Rehydrate the file

    datasets rehydrate --directory bacteria_refseq/

I'm recommending the dehydrated option because it's actually faster and more reliable, despite the additional steps. By default, the data package includes genomic, transcript, protein and cds sequences, in addition to gff3. If you only need the genomic fasta sequences, you can use this command instead:

datasets download genome bacteria --assembly-source refseq \
--dehydrated --exclude-protein --exclude-genomic-cds \
--exclude-rna --exclude-gff3 --filename bacteria_refseq_fasta.zip

After that, you can follow the steps 2 and 3 in the same way.

Let me know if that works or if you have any other questions. :)

ADD COMMENT

Login before adding your answer.

Traffic: 1959 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6