You can easily check fo alignment information in the sra run browser.
I had a conversation with the SRA team once where they explained to me that they really optimized SRA for generating FASTQ and running BLAST queries and not for generating SAM. I’ve noticed that SAM dump is usually slower than I’d like, but if you're truly getting alignment info I’m sure you’re saving time over aligning the FASTQ.
Nice, I'd never noticed the alignment window there before (it's unfortunate that this dataset used NCBI "chromosome names"). I guess I've never been trusting enough of what other people did to want to actually use their alignments...
It's ridiculously fast (the example command has a bandwith request of 100Mb/s, but I've used 400Mb/s before, depends on your local setup), then you can dump the fastq from the downloaded .sra file using the toolkit's fastq-dump --split-3)
Do you have more than 100Mb/s available? Aspera will happily use all the bandwidth it can lay its hands on (up to 10 Gbps) as long as the source supports it (NCBI does).
Nothing you can do about it. If you have access to a SSD, it will speed up things but fastq-dump will always be slow. Especially on GPFS, where the random access slows down the system a lot
If your file system is already slow in the first place, you will have a hard time. See if the data are mirrored at the European Nucleotide Archive ENA, which also supports Aspera download of fastq instead of sra.
I don't think random access is the problem. The SRA format is a column-oriented database, so there should be very little seeking when you're dumping FASTQ. I think the problem is in dumping SAM format you're encountering a slowdown because the SAM fields (alignment information) are not retrieved as efficiently as the sequence and quality scores.
The solution is to use ENA rather than SRA - everything apart from the controlled access stuff is mirrored accross and ENA store the raw fastq, which can be downloaded directly by ascp.
You can dump “spots” 1 through n using one process, and n through k using another process. Basically run fastq-dump on the same SRA archive but exporting a different chunk of the fastq file. This will scale until you run out of disk IO or CPU threads.
I would like to share my twist and turns with SRA download. It took several interactions with NCBI staff to figure this out. below are the steps:
Prerequisite:
sra-toolkit and aspera plugin installed. The instructions are specific to Linux environment.
Steps:
Configure workspace
The workspace for downloading the SRA data must be cached. Although SRA-toolkit is installed centrally, this need to be set manually for every user. Please follow this link and navigate to the section Configuring the Toolkit.
Assuming sra-toolkit is installed or loaded, run the following command and complete setup as mentioned in the link.
vdb-config -i
Download SRA file
prefetch -X 200G SRR2095320 -a "/depot/bioinfo/apps/apps/aspera-connect-3.1.1.70545/bin/ascp|/depot/bioinfo/apps/apps/aspera-connect-3.1.1.70545/etc/asperaweb_id_dsa.putty"
where "-a" specify the path for the aspera binary and private key file. Prefetch will download the SRA data as well as all needed references to your local cache. This prevents sending multiple requests to NCBI servers and save substantial time.
Demultiplex with sra-toolkit
fastq-dump -I --split-files ./SRR2095320.sra
I found above approach extremely fast. using this approach 44GB file for SRR2095320.sra was downloaded, prefetched and converted to (151 GB R1 + 151 GB R2) data in about 25 hours using 10 processors. While using only the standard fastq-dump in >70 hours it could only convert (80 GB R1 + 80 GB R2) and job failed because of the connection issue.
You can even select to download as FASTA (without quality scores) with --fasta option to reduce the download volume; however, probably not much recommended! :-)
Are you able to get use fastq files (which you can then align yourself)? If so get them from EBI-ENA example.
It's quite likely that what you're getting is an unaligned BAM file, which is largely useless.
You can easily check fo alignment information in the sra run browser.
I had a conversation with the SRA team once where they explained to me that they really optimized SRA for generating FASTQ and running BLAST queries and not for generating SAM. I’ve noticed that SAM dump is usually slower than I’d like, but if you're truly getting alignment info I’m sure you’re saving time over aligning the FASTQ.
Nice, I'd never noticed the alignment window there before (it's unfortunate that this dataset used NCBI "chromosome names"). I guess I've never been trusting enough of what other people did to want to actually use their alignments...
Maybe you can try aspera.