Question

Downloading paired end fastq data from SRA

0

Entering edit mode

3 months ago

Matheus • 0

Hello,

I am trying to download the single-cell data for project PRJNA971535 using the SRA Run Selector and the prefetch command. After downloading the file, I use the command fastq-dump --split-files, but I am getting only a single output file (e.g., SRR24503270_1.fastq).

I have already tried both fastq-dump and fasterq-dump with all possible split parameters (e.g., --split-files), but despite using these parameters, I only receive one FASTQ file.

The library is definitely paired-end, as mentioned in the paper. Does anyone know how to properly split these samples? I have emailed the authors but have not received a response.

SRA single-cell RNA-Seq • 1.0k views

ADD COMMENT • link 3 months ago by Matheus • 0

1

Entering edit mode

Did you try -3 option? Also, it might be relevant that this is single cell sequencing and the original data was in 10X-genomics bam format? Just some wild guesses... Could you check if the reads in the single file you got are interleaved?

ADD REPLY • link 3 months ago by Michael 55k

1

Entering edit mode

https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR24503270&display=metadata

"This run has 1 read per spot:"

ADD REPLY • link 3 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks for all the anwsers!

I did try downloading the BAM files and using bam2fastq and it worked! I have now the paired-end files.

Thanks again! =]

ADD REPLY • link 3 months ago by Matheus • 0

score 5 · Accepted Answer · 2024-08-21

Landscape of 10x data in SRA is all over the place. Because of the complexity of data (and vintage of SRA) data submission is complicated for both parties.

Fortunately submitters provided the original BAM format file from 10x in Data Access tab: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR24503270&display=data-access

You can download that and then use bam2fastq util 10x provides to convert the data back to fastq format: https://github.com/10XGenomics/bamtofastq/releases

score 3 · Accepted Answer · 2024-08-22

This is a notorious problem we run into all the time with SRA, sometimes it can be fixed with additional flags, as described by others, but the next in line is checking if you can get the data from the European Nucleotide Archive and it seems like you can for at least some samples:

https://www.ebi.ac.uk/ena/browser/view/PRJNA971535

Then in cases where ENA can't help either, my last resort is instructing the SRA to dump the FASTQ files into a Google Cloud (or AWS) bucket. On the SRA Run Selector site, you mark your samples, then select "Deliver Data". It also involves modifying the permissions to the bucket as per SRA instructions but then you just wait usually a day before you have your data to move to your cluster. But this is not free, most recently, I was charged by Google around $83 for "network transfer" and "download", in addition to "storage" ($0.35). The volume of the data was around 600GB (10x scRNA-seq FASTQ). The fees may be region-dependent as previously, I only paid for storage in a different region (us-east). SRA itself does not charge for this service.

The good thing about the Cloud Delivery is that you get the original author-deposited split FASTQ files, not chewed up by SRA, additionally likely following the Illumina naming convention which comes in handy for running Cell Ranger.