This is a brief tutorial about methods of downloading sra, sam and fastq files, mainly focusing on Aspera Connect.
NCBI-SRA and EBI-ENA databases
SRA: Sequence Read Archive: It belongs to NCBI (National Center for Biotechnology Information), is a database storing high throughput sequencing (HTS) raw data, alignment information and metadata. Almost all HTS data in published publications will be asked uploading to here, and stored as .sra compressed file format.
ENA: European Nucleotide Archive: It belongs to EBI (European Bioinformatics Institute), although it has the same funtion with SRA, more annotations and friendlier website make it preferable. What's more, you could download directly fastq.gz
files from it.
File Downloading
Mostly, we download sra files for the purpose of getting corresponding fastq or sam files, so as to use them in our own pipeline for downstream analysis.
- Places: You should search ENA database first with the SRR (SRA Run) accession number to check if it is there. If not, go to SRA database.
Methods:
- First Choice -- Aspera Connect. It is a commercial high speed file transfer software produced by IBM. Since it has contract with NCBI and EBI, we could use it to download data in those two databases for free. Many sites can transfer data at 200-500Mbps. and nearly all sites can transfer at faster than 10Mbps.
If the Aspera Connect doesn't work, I would recommend the prefetch command in sratoolkit.
At last, please try fastq-dump and sam-dump in sratoolkit. If the connection of
fastq-dump
is unstable, I would suggest the wonderdump script in Biostar Handbook.
Warning: Try not to use wget
or curl
to download, it might cause incompletion in downloaded sra files.
You can also speed-up fastq-dump by using the parallel version: https://github.com/rvalieris/parallel-fastq-dump
I would try it, thx!
Thanks Wenhu_Cao,
also see for a related post with a few more details: Fast download of FASTQ files from the European Nucleotide Archive (ENA)