Hi everyone,
I would really like to download the raw data of a specific public single-cell RNA-Seq experiment (ENA, GEO). As the BCL files do not seem to be available the most "raw" format would probably be paired end fastq files. Currently I am unable to download the files in a split way and I would really appreciate your help.
For simplicity just focus on one sample: Donor1_scRNA-seq_rep1 (GSM3052917, Experiment: SRX3815586, Run: SRR6860519)
I already tried fastq-dump
and fasterq-dump
with all possible split parameters (--split-files
etc.) but despite of the parameter I just receive one fastq file.
fastq-dump --split-files SRR6860519
fasterq-dump -S SRR6860519
The library type is definitely paired
and at ENA one can see two submitted MD5-sums per sample.
Does anyone know how to split these samples correctly? And does it make a difference if I provide the experiment accession or the run accession to fastq-dump/fasterq-dump?
Thanks in advance!
Although the sample was described as paired-end, I am sure the sample only contains one read, and there was a note - "This run has 1 read per spot", please click here: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR6860519
Yes it does. Not the first time there is something missing on NCBI. Contacting the authors is probably your best choice.
I think the authors only uploaded the R2 fastq files, and not the R1 file containing the UMI sequence. Here you can read in
Extraction protocol
andData processing
that R1 is 26 nt and R2 is 100 nt long. If you look in the fastq file, you see only 100 (101) nt long reads. If you want the UMI as well, I am afraid you'll have to ask the authors (as ATpoint is suggesting).Can we analyze Single-cell Sequencing data without a .fastq file containing information related to UMI ?
Thanks for pointing that out.
What I am still curious about are the two MD5 checksums that are available per sample (at ENA). Wouldn't that mean that the authors indeed uploaded two files per sample?
Edit: Found the answer myself for the two checksums: At ENA there were two files submitted per sample: A BAM-file and an related index (.BAI).
I know this is old but I came across this thread because I had problems downloading paired-end data from SRA as well (getting only one file even though protocol says it was paired). Only after reading the description I realized it is the exact same data set I tried to access. I'll just use this opportunity to blow off some steam. Does anybody really want me to believe submitting only 1 of 2 FASTQ files was NOT intentional?!
Please don't ever do this because I've wasted hours trying to get the data. "Fun" fact, a reviewer would like me to look at this data set. Well, I'd love to do that... If this reviewing process was in a journal of the Nature publishing group (in which the data set was published) that would be the true full circle face palm.
Actually, no FASTQ files were deposited. See previous comment:
The caveat is that this a 10x Genomics single-cell data submission. For those, GEO suggests submitting the BAM files. From GEO:
This is one of the reasons to use ENA. Same file there where it is more obvious what is happening: https://www.ebi.ac.uk/ena/data/view/SRS3065426
You can find the R2 (100bp) in the SRA and can download it from there (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR6860519). Which is what I did and then found out that's all there is... And the BAM file (output of Cell Ranger) is pretty useless if they don't also upload the transcript count matrix to GEO which they could easily do, I've seen that before. And while I agree that suggesting the BAM file upload for 10x Genomics single-cell data on GEO is a problem, the involved bioinformatician/computational biologists for sure knows how useless those files are to validate/confirm/reproduce the results.
Yes, they convert the BAM to SRA, but without the additional info, it is not very useful. The BAM still retains the barcodes.
10x provides a tool to generate the proper FASTQs from the BAM: https://support.10xgenomics.com/docs/bamtofastq
Thank you for the link. I didn't know about that tool. I will test it with the BAM files provided by the authors and, if I remember, report back if I was successful.
While the results I got indicated serious issues in some samples according to Cell Ranger (few reads in cells), it did work as expected. That is, you download the BAM, re-generate the FASTQ files from the using the bamtofastq tool posted by Igor, and use that as input to Cell Ranger. I couldn't find info which chemistry version of the library preparation kit the authors used so I don't know whether Cell Ranger correctly identified it as v2. If Cell Ranger assessed the cell counts correctly, we see on average ~2,000-3,000 cells per sample (10 samples), leaving a big jump to the final data set of ~6,500 cells shown in the publication. Moreover, in the Materials & Methods part of the paper they say the cellular barcode is 10 bp and the UMI 16 bp, which would be new to me.
I have a different understanding of reproducibility.
10x v2 recommended sequencing protocol is 26+8+98, so those numbers make sense. It's supposed to be 16bp cell barcode and 10bp UMI. That is probably a typo when they swap the numbers.
I've never had Cell Ranger misidentify chemistry, so that probably should not be a major concern. Based on the date, it was before v3. Based on the sequencing protocol, it was not v1.
They also provide the raw counts matrix on GEO, so you can also compare what you get to that one.
Yeah most reviewers don't even check the GEO submission, it sad but true. Did anyone contacted the authors of the paper about this? On GEO page there is an email address. If he/she won't answer it might be worth confronting the editor of the journal about this (or start a discussion on twitter).
And maybe contact GEO as well, they should check the submissions.