Tutorial:How to get FASTQ reads from the Short Read Archive (SRA)
5
26
Entering edit mode
3.1 years ago

There are many ways for getting FASTQ data out of the short-read archive. I have spent a few hours today investigating the various approaches. (Edit: see also the many alternatives posted as followups)

TLDR

  1. if you want a subset of the reads say 1000 reads use fastq-dump -X 1000 SRR14575325
  2. if you want the entire file use fasterq-dump SRR14575325
  3. if you want to be in full control find the URLs, then use wget or curl to get the data
  4. if you feel lucky use prefetch SRR14575325

We will be downloading the file SRR14575325

  • SRR14575325.sra is 577M
  • SRR14575325.fastq is 3.3Gb

Note that some methods store (cache) files thus store both the SRA and FASTQ files. For those tools subsequent FASTQ conversions will perform faster. I am cleaning the cache in my examples only to ensure that I correctly measure the performance.

Tools

Some examples require tools from sratools, to install them use:

# Currently installs version 2.9
conda install sratools

or visit the webpage and download binaries:

Use fastq-dump

Use fastq-dump directly:

# Clean the cache
rm -f ~/ncbi/public/sra/SRR14575325*

# Convert ten reads
time fastq-dump SRR14575325  -X 10
# 1 seconds

# Convert all reads
time fastq-dump SRR14575325
# 5 minutes

Total time 5 minutes. fastq-dump will stores the SRA file a cache folder. On my system is located in

~/ncbi/public/sra/SRR14575325.sra

Subsequent fastq dump on the same accession will take 1 minute. The principal advantage of fastq-dump over all other methods is that it supports the partial download of data.

Use fasterq-dump

fasterq-dump is the future replacement for fastq-dump. According to the documentation, it requires up to 10x as much disk space as the final file. In addition, it does not yet support downloading a subset of the data as fastq-dump does:

# Clean your cache file
rm -f ~/ncbi/public/sra/SRR14575325*

# Convert all reads
time fasterq-dump -f SRR14575325
# 1.1 minutes

Total time 1 minute. fasterq-dump also stores the data in the cache as:

~/ncbi/public/sra/SRR14575325.sra.cache

Subsequent runs take 30 seconds

Download the SRA file directly

The challenge here is to find the proper URLs. For example the SRA file URL is located in the 10th column of the output that you get with:

# Find the URL
efetch -db sra -id SRR14575325 -format runinfo | cut -f 10 -d ,

prints:

download_path
https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/SRR14575325/SRR14575325.1

Download an SRA file locally and use that:

# Clean your cache file
rm -f ~/ncbi/public/sra/SRR14575325*

URL1=https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/SRR14575325/SRR14575325.1

time wget $URL1
# 52 seconds

fastq-dump SRR14575325.1
# 1 minute

Total time 2 minutes. As before we have both an SRA and FASTQ files.

Using prefetch

The sratools prefetch command will download an SRA then store it in a cache directory. The behavior of prefetch has changed, versions before 2.10 will download files into the cache directory. Versions 2.10 and above will download the files into the local directory.

The new versions of prefetch do not operate seamlessly with fastq-dump anymore. For versions under 2.10 the two commands:

prefetch SRR14575325
fastq-dump SRR14575325

would both make use of the same files. Alas with the new version, you would need to run them like so:

prefetch SRR14575325
fastq-dump SRR14575325/SRR14575325.sra

... ¯\_(ツ)_/¯ ... all in the name of progress I guess. Just remember that commands and examples in training materials may not work correctly anymore. Some people claim that prefetch can download fastq files with

prefetch --type fastq SRR14575325

but when I tried it I got:

2021-10-14T18:01:00 prefetch.2.11.2 err: name not found while resolving query within virtual file system module - failed to resolve accession 'SRR1972739' - no data ( 404 )

getting these weird errors with sratools is not uncommon. Various fixes exist (Google for them) yet no solution seems reliable enough, see: https://github.com/ncbi/sra-tools/issues/35

If you get this error, try some fixes or just pick a different method from the list.

But let's continue the journey; we ran the commands below with version 2.9:

# Clean the cache
rm -f ~/ncbi/public/sra/SRR14575325*

time prefetch SRR14575325
# 57 seconds

# Convert ten reads
time fastq-dump SRR14575325  -X 10
# 0 seconds

# Convert all reads
time fastq-dump SRR14575325
# 1 minute

Total time of 2 minutes. Stores the SRA file in the cache under the name:

Subsequent conversions with fastq-dump will take 1 minute since it uses the cache file.

Download from EBI

to find the EBI link to an SRA file use:

curl -X GET "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=SRR14575325&fields=fastq_ftp&result=read_run"

prints:

run_accession   fastq_ftp
SRR14575325 ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/025/SRR14575325/SRR14575325.fastq.gz

Let's use the EBI link:

URL=https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/025/SRR14575325/SRR14575325.fastq.gz
wget $URL

The download was slow, estimated time of 15 minutes, I did not wait to finish. The next day I tried again the download seemed much faster under a minute. Your mileage may vary.

fastq sra • 16k views
ADD COMMENT
1
Entering edit mode

Question: I just need to make sure of something, you clean the cache here rm -f ~/ncbi/public/sra/SRR14575325* even though you did not download the file before, just in case? Does it really affect the downloading time significantly?

ADD REPLY
0
Entering edit mode

It is generally a good idea to clean the cache from time to time. Often, we are using a large partition with much larger space/quota to download data for analysis, but with the default setting, the cached files end up in our home directories anyway and clutter it up. Either you configure the toolkit to use a different directory, or check regularly.

ADD REPLY
0
Entering edit mode

Quick questions: Is it possible that if you are downloading +10 gigabytes of RNA-seq data from the NCBI SRA archive using "fastq-dump" or "fasterq-dump", over a wifi connection, that you can possibly not acquire the total data? Is there a command line to check the integrity of the data? If we previously downloaded the same data, is it a smart move to clean the cache after we deleted those files?

ADD REPLY
2
Entering edit mode

WiFi connection for that much of data sounds like a bad idea but if you have great upstream connectivity (e.g. fiber) and a new WiFi6 then it may be ok .. as long as you are patient.

vdb-validate program included in sratoolkit will allow you to validate the downloaded data.

ADD REPLY
0
Entering edit mode

thanks!!!!!

ADD REPLY
2
Entering edit mode
3.1 years ago

The SRA Explorer offers more user-friendly navigation to SRA and will also lists the URLs to the data:

ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Great tutorial!

Just to add, there is also this great video from Babrahan Bioinformatics that cover some of the topics mentioned in this tutorial:

This is how I first learned

ADD REPLY
2
Entering edit mode
3.1 years ago

The nf-core fetchngs nextflow pipeline is also quite handy for downloading from SRA, ENA, GEO, etc. Just need a list of IDs and whack it into:

nextflow run nf-core/fetchngs -r 1.3 --input ids.txt -profile singularity

Easily parallelized in an HPC environment and also snags the metadata, optionally formatting it into samplesheets for downstream nf-core pipelines if wanted.

ADD COMMENT
1
Entering edit mode
3.1 years ago

My personal method is from ENA using ascp.

Instructions can be found here.

You can download ascp as part of aspera here:

The default installation install aspera to your home directory, and so no root access is needed.

The file mentioned by @IstvanAlbert above can then be downloaded with

time ascp -QT -l 300m -P33001 \
              -i ~/.aspera/cli/etc/asperaweb_id_dsa.openssh \
              era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR145/025/SRR14575325/SRR14575325.fastq.gz .

Where, here ~/.aspera/cli is the location aspera is installed to.

The URLs to download from are easily converted from ftp to ascp by:

  1. Changing ftp.sra.ebi.ac.uk to fasp.sra.ebi.ac.uk.
  2. Adding the era-fasp user to the front.
  3. changing the first / to a :

The -l 300m in the command limits the download to 300 MB/s which is sometimes neccessary to prevent you from saturating your institutions internet connection.

Using this method, the file that took @IstvanAlberts 15 minutes to download took me 33 seconds.

ADD COMMENT
0
Entering edit mode

Also, if I were just downloading a single file, I would usually grab the url from the Eurpean Nucleotide Archive (ENA) website - just go to https://www.ebi.ac.uk/ena/browser/home and put the SRA accession into the search box.

ADD REPLY
1
Entering edit mode
2.8 years ago
liorglic ★ 1.4k

Personally I prefer to avoid SRA altogether, as I find it to be hard to use and slow. Instead I use ENA, which mirrors all the SRA data with more friendly interface and better performance. For small downloads, one can just use wget to fetch fastq files from ftp. For larger downloads, I use the tool Kingfisher (previously ena-fast-download), which uses the Aspera file transfer protocol. For me it increased download speed ~50x!

ADD COMMENT
0
Entering edit mode
2.8 years ago
Michael 55k

Interestingly, fastq-dump can also fetch assemblies. I am not sure if that feature is documented somewhere. I hope it is not going away soon.

The following script tries to find all assemblies (TSA and WGS) for a given taxid and then downloads them. Use a low taxonomic rank, e.g. at the family level to try it, ("Caligidae" is safe) otherwise, the download will be huge.

#!/bin/sh
### usage: fetchAllAssembliesByTaxid.sh <taxon name>
TAX=$1
## alternative query string for searching for numeric taxid echo running: esearch -db nuccore -query '((txid'${TAX}'[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))'
RESULT=`esearch -db nuccore -query '(('${TAX}'[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))' | \
 efetch -format xml | tee ${TAX}.esearch.xml`
echo running xtract...
 ID=`echo $RESULT | xtract -pattern Seq-entry  -element Textseq-id_name`

COUNT=`echo $ID | wc -w`
echo processing $COUNT entries

for I in $ID ; do
  echo Downloading $I ...
  if [ -e $I.fasta ]
  then
    echo " skipping because file exists."
    continue # skip if the file has been downloaded already
   fi
   fastq-dump --fasta 80 -F $I
done

echo finished downloading
ADD COMMENT

Login before adding your answer.

Traffic: 2658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6