Downloading paired end fastq data from SRA
2
0
Entering edit mode
3 months ago
Matheus • 0

Hello,

I am trying to download the single-cell data for project PRJNA971535 using the SRA Run Selector and the prefetch command. After downloading the file, I use the command fastq-dump --split-files, but I am getting only a single output file (e.g., SRR24503270_1.fastq).

I have already tried both fastq-dump and fasterq-dump with all possible split parameters (e.g., --split-files), but despite using these parameters, I only receive one FASTQ file.

The library is definitely paired-end, as mentioned in the paper. Does anyone know how to properly split these samples? I have emailed the authors but have not received a response.

SRA single-cell RNA-Seq • 1.0k views
ADD COMMENT
1
Entering edit mode

Did you try -3 option? Also, it might be relevant that this is single cell sequencing and the original data was in 10X-genomics bam format? Just some wild guesses... Could you check if the reads in the single file you got are interleaved?

ADD REPLY
1
0
Entering edit mode

Thanks for all the anwsers!

I did try downloading the BAM files and using bam2fastq and it worked! I have now the paired-end files.

Thanks again! =]

ADD REPLY
5
Entering edit mode
3 months ago
GenoMax 147k

Landscape of 10x data in SRA is all over the place. Because of the complexity of data (and vintage of SRA) data submission is complicated for both parties.

Fortunately submitters provided the original BAM format file from 10x in Data Access tab: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR24503270&display=data-access

You can download that and then use bam2fastq util 10x provides to convert the data back to fastq format: https://github.com/10XGenomics/bamtofastq/releases

ADD COMMENT
3
Entering edit mode
3 months ago
jaro.slamecka ▴ 270

This is a notorious problem we run into all the time with SRA, sometimes it can be fixed with additional flags, as described by others, but the next in line is checking if you can get the data from the European Nucleotide Archive and it seems like you can for at least some samples:

https://www.ebi.ac.uk/ena/browser/view/PRJNA971535

Then in cases where ENA can't help either, my last resort is instructing the SRA to dump the FASTQ files into a Google Cloud (or AWS) bucket. On the SRA Run Selector site, you mark your samples, then select "Deliver Data". It also involves modifying the permissions to the bucket as per SRA instructions but then you just wait usually a day before you have your data to move to your cluster. But this is not free, most recently, I was charged by Google around $83 for "network transfer" and "download", in addition to "storage" ($0.35). The volume of the data was around 600GB (10x scRNA-seq FASTQ). The fees may be region-dependent as previously, I only paid for storage in a different region (us-east). SRA itself does not charge for this service.

The good thing about the Cloud Delivery is that you get the original author-deposited split FASTQ files, not chewed up by SRA, additionally likely following the Illumina naming convention which comes in handy for running Cell Ranger.

ADD COMMENT
2
Entering edit mode

Thanks for adding this perspective. Information about setting up your own cloud instance to receive data is detailed by NCBI here: https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/

As with "all things cloud", costs can quickly spiral out of hand if one is not careful.

ADD REPLY

Login before adding your answer.

Traffic: 1892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6