Hey all,
I am trying to download multiple reads of a large data set from SRA. What I've done in the past is go through and for each read set get the FTP link and add it to a file and then run a wget loop in a shell script to download all of the links. I was wondering if there was another way to do so. I need to download a very large set (terabytes) with many paired reads. Is there a way in run selector to do so? I know it allows you to download a JWT but I'm not sure how that works
I found that downloading from ENA is much faster than SRA, but still rather slow. What really improved speed for me was using this script (based on Aspera). Download speed increased about x60.
Here is a tutorial that covers download from ENA and NCBI Fast download of FASTQ files from the European Nucleotide Archive (ENA) efficiently.
For large data downloading fastq directly from ENA is probably the fastest way.
If it is access-restricted and you have to download from NBI then use prefetch and parallel-fastq-dump, all covered in the tutorial.
It also contains a link to sra-explorer, a handy tool that can provide download links for NCBI fastq files and is great to query NCBI for data.
awesome thanks! I have it mostly down but I am encountering an error when I run the for loop, any idea what this may come from? I'll also show the command being used below.
I'm sorry I'm not exactly sure what is meant by content, what I've been doing is trying to run a single line from the download.txt file to see if it runs before running the download for loop
I actually managed to fix that issue, now when iterating over the download files i recieve the following error
Session Stop (Error: Private key file not found at path /global/home/users/user_name/.ssh/$HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh and path /global/home/users/user_name/.ssh/$HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh)
So i followed your code exactly, the only different I can think of is I have the download.txt in my scratch folder (where I have alot of download space on the HPC and not my home directory.
So I've managed to troubleshoot the problem as there was something wrong with the $HOME location so I replaced it with ~. Now I get the following error:
ascp: Failed to open TCP connection for SSH, exiting.
Session Stop (Error: Failed to open TCP connection for SSH)
Is this an issue with connected to ENA? or the server I am running the command from?
Yes, ENA is currently moving its data center and has announced that services will be impaired or unavailable the next week(s). That virus that is spreading around will do its part in slowing that down as well. Currently it is probably best to download SRA with prefetch and then convert with fastq-dump. My tutorial covers this as well.
There are literally hundreds of answers on Biostars to this question. Maybe start with this, and there are many other links on the same page where it says "Similar posts."
Yes, good point. The script you link is based on the tutorial I linked in my answer.