Question

Downloading paired end fastq from SRA

4

Entering edit mode

5.0 years ago

t.t ▴ 40

Hi everyone,

I would really like to download the raw data of a specific public single-cell RNA-Seq experiment (ENA, GEO). As the BCL files do not seem to be available the most "raw" format would probably be paired end fastq files. Currently I am unable to download the files in a split way and I would really appreciate your help.

For simplicity just focus on one sample: Donor1_scRNA-seq_rep1 (GSM3052917, Experiment: SRX3815586, Run: SRR6860519)

I already tried fastq-dump and fasterq-dump with all possible split parameters (--split-files etc.) but despite of the parameter I just receive one fastq file.

fastq-dump --split-files SRR6860519
fasterq-dump -S SRR6860519

The library type is definitely paired and at ENA one can see two submitted MD5-sums per sample.

Does anyone know how to split these samples correctly? And does it make a difference if I provide the experiment accession or the run accession to fastq-dump/fasterq-dump?

Thanks in advance!

RNA-Seq singlecell SRA • 11k views

ADD COMMENT • link updated 3.7 years ago by Brunox13 ▴ 50 • written 5.0 years ago by t.t ▴ 40

1

Entering edit mode

Although the sample was described as paired-end, I am sure the sample only contains one read, and there was a note - "This run has 1 read per spot", please click here: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR6860519

ADD REPLY • link 5.0 years ago by zhangdengwei ▴ 210

1

Entering edit mode

Yes it does. Not the first time there is something missing on NCBI. Contacting the authors is probably your best choice.

ADD REPLY • link 5.0 years ago by ATpoint 81k

2

Entering edit mode

I think the authors only uploaded the R2 fastq files, and not the R1 file containing the UMI sequence. Here you can read in Extraction protocol and Data processing that R1 is 26 nt and R2 is 100 nt long. If you look in the fastq file, you see only 100 (101) nt long reads. If you want the UMI as well, I am afraid you'll have to ask the authors (as ATpoint is suggesting).

ADD REPLY • link 5.0 years ago by Benn 8.3k

0

Entering edit mode

Can we analyze Single-cell Sequencing data without a .fastq file containing information related to UMI ?

ADD REPLY • link 4.7 years ago by singlecell • 0

0

Entering edit mode

Thanks for pointing that out.

What I am still curious about are the two MD5 checksums that are available per sample (at ENA). Wouldn't that mean that the authors indeed uploaded two files per sample?

Edit: Found the answer myself for the two checksums: At ENA there were two files submitted per sample: A BAM-file and an related index (.BAI).

ADD REPLY • link 5.0 years ago by t.t ▴ 40

0

Entering edit mode

I know this is old but I came across this thread because I had problems downloading paired-end data from SRA as well (getting only one file even though protocol says it was paired). Only after reading the description I realized it is the exact same data set I tried to access. I'll just use this opportunity to blow off some steam. Does anybody really want me to believe submitting only 1 of 2 FASTQ files was NOT intentional?!

Please don't ever do this because I've wasted hours trying to get the data. "Fun" fact, a reviewer would like me to look at this data set. Well, I'd love to do that... If this reviewing process was in a journal of the Nature publishing group (in which the data set was published) that would be the true full circle face palm.

ADD REPLY • link 4.6 years ago by Roman Hillje ▴ 90

1

Entering edit mode

Actually, no FASTQ files were deposited. See previous comment:

Found the answer myself for the two checksums: At ENA there were two files submitted per sample: A BAM-file and an related index (.BAI).

The caveat is that this a 10x Genomics single-cell data submission. For those, GEO suggests submitting the BAM files. From GEO:

for single-cell sequencing studies (e.g. 10x Genomics, Drop-Seq, InDrops), we can support the submission of multiplexed files in cases where these files are required for reanalysis in your pipeline, or when demultiplexing would create an unmanageable number of files

This is one of the reasons to use ENA. Same file there where it is more obvious what is happening: https://www.ebi.ac.uk/ena/data/view/SRS3065426

ADD REPLY • link 4.6 years ago by igor 13k

1

Entering edit mode

You can find the R2 (100bp) in the SRA and can download it from there (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR6860519). Which is what I did and then found out that's all there is... And the BAM file (output of Cell Ranger) is pretty useless if they don't also upload the transcript count matrix to GEO which they could easily do, I've seen that before. And while I agree that suggesting the BAM file upload for 10x Genomics single-cell data on GEO is a problem, the involved bioinformatician/computational biologists for sure knows how useless those files are to validate/confirm/reproduce the results.

ADD REPLY • link 4.6 years ago by Roman Hillje ▴ 90

3

Entering edit mode

Yes, they convert the BAM to SRA, but without the additional info, it is not very useful. The BAM still retains the barcodes.

10x provides a tool to generate the proper FASTQs from the BAM: https://support.10xgenomics.com/docs/bamtofastq

ADD REPLY • link 4.6 years ago by igor 13k

0

Entering edit mode

Thank you for the link. I didn't know about that tool. I will test it with the BAM files provided by the authors and, if I remember, report back if I was successful.

ADD REPLY • link 4.6 years ago by Roman Hillje ▴ 90

0

Entering edit mode

While the results I got indicated serious issues in some samples according to Cell Ranger (few reads in cells), it did work as expected. That is, you download the BAM, re-generate the FASTQ files from the using the bamtofastq tool posted by Igor, and use that as input to Cell Ranger. I couldn't find info which chemistry version of the library preparation kit the authors used so I don't know whether Cell Ranger correctly identified it as v2. If Cell Ranger assessed the cell counts correctly, we see on average ~2,000-3,000 cells per sample (10 samples), leaving a big jump to the final data set of ~6,500 cells shown in the publication. Moreover, in the Materials & Methods part of the paper they say the cellular barcode is 10 bp and the UMI 16 bp, which would be new to me.

I have a different understanding of reproducibility.

ADD REPLY • link 4.6 years ago by Roman Hillje ▴ 90

0

Entering edit mode

10x v2 recommended sequencing protocol is 26+8+98, so those numbers make sense. It's supposed to be 16bp cell barcode and 10bp UMI. That is probably a typo when they swap the numbers.

I've never had Cell Ranger misidentify chemistry, so that probably should not be a major concern. Based on the date, it was before v3. Based on the sequencing protocol, it was not v1.

They also provide the raw counts matrix on GEO, so you can also compare what you get to that one.

ADD REPLY • link 4.6 years ago by igor 13k

0

Entering edit mode

Yeah most reviewers don't even check the GEO submission, it sad but true. Did anyone contacted the authors of the paper about this? On GEO page there is an email address. If he/she won't answer it might be worth confronting the editor of the journal about this (or start a discussion on twitter).

And maybe contact GEO as well, they should check the submissions.

ADD REPLY • link 4.6 years ago by Benn 8.3k

score 1 · Answer 1 · 2020-08-15

I had a somewhat similar problem - I was using fasterq-dump (and I tried all possible --split settings and even absence thereof) to download scRNA-seq data (for example, run SRR9169172) and was only getting a single fastq file every time.

What helped was the following:

Upgrading sra-tools to the newest version (currently 2.10.8)
Using prefetch --type fastq SRR9169172 as described here, which directly downloads the data originally deposited by the authors

I was lucky that the authors had deposited the original fastq files, and so prefetch enabled me to download the data in a usable format (unlike fasterq-dump).