I am downloading reads from SRA to run in CellRanger.
prefetch -p -r yes --max-size 40G -O . SRR10419617
fasterq-dump -O . --threads 4 --mem "26G" --split-3 --skip-technical --print-read-nr --progress SRR10419617
This produces two fastq files:
zcat SRR10419617_1.fastq.gz | head
@SRR10419617.1/1 1 length=8
NGTGGAAC
+SRR10419617.1/1 1 length=8
#AAAFJFF
@SRR10419617.2/1 2 length=8
NGTGGAAC
+SRR10419617.2/1 2 length=8
#AAFFJJJ
zcat SRR10419617_2.fastq.gz | head
@SRR10419617.1/2 1 length=76
NNNGCCTAGTTAACGCATTTACTAAACGCAGACGAAAATGGAAAGATTAATTGGGAGTGGTAGGATGAAACAATTT
+SRR10419617.1/2 1 length=76
###-<<FJFFJJJJJJJ<JJJJJJJJJJJJJJJJFJFJJJJJFJJJJJ<JJJJFJ<JAAJAFFJJFJFJFJJFJFJ
@SRR10419617.2/2 2 length=76
NNNACAGCTATTTCATTATGTGCAATGTGTTACACCCTTTCAAATGTAATAAACTCACAACAAAATTGAAACATAA
+SRR10419617.2/2 2 length=76
###<<FJJJJJJJJJJJJJJFJJJJJJJJJJJJJFAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
I renamed them to fit with CellRanger.
SRR10419617_S1_L001_R1_001.fastq.gz
SRR10419617_S1_L001_R2_001.fastq.gz
Then I run CellRanger:
cellranger count\
--nosecondary \
--id "SRR10419617" \
--transcriptome "${CELLRANGER_DATA}/refdata-gex-GRCh38-2020-A/" \
--fastqs "../../raw/reads/PRJNA588461/SRR10419617" \
--sample "SRR10419617" \
--localcores 4 \
--localmem 25
And it fails with this error:
[error] Pipestance failed. Error log at:
SRR10419617/SC_RNA_COUNTER_CS/SC_MULTI_CORE/MULTI_CHEMISTRY_DETECTOR/_GEM_WELL_CHEMISTRY_DETECTOR/DETECT_COUNT_CHEMISTRY/fork0/chnk0-u1e9cf4fd5a/_errors
Log message:
The read lengths are incompatible with all the chemistries for Sample SRR10419617 in "/raw/reads/PRJNA588461/SRR10419617".
read1 median length = 8
read2 median length = 76
index1 median length = 0
The minimum read length for different chemistries are:
SC5P-R2 - read1: 26, read2: 25, index1: 0
SC5P-PE - read1: 81, read2: 25, index1: 0
SC3Pv1 - read1: 25, read2: 10, index1: 14
SC3Pv2 - read1: 26, read2: 25, index1: 0
SC3Pv3 - read1: 26, read2: 15, index1: 0
SC3Pv3LT - read1: 26, read2: 25, index1: 0
We expect that at least 50% of the reads exceed the minimum length.
I have also tried changing the fast file names. R1 and R2 and R2 as L1 etc, but same error.
Does anyone know what could be the issue? Incorrect fastq names? Should it be R1, R2 and L1? Is a file missing? Is two fastq files with 76 and 8 nucleotides an expected output for 10X?
sratools/2.10.9
EDirect/15.1
cellranger/6.0.2
Thanks for the reply. This seems to be a systemic problem. I think I have looked at 4 different studies and all the 10X SRA files seem to be like this. Does anyone know a SRR id with 10X data that actually works? Just to test my workflow/script.
firestar there are plenty of good examples. Here is one SRR17102621.
fastq-dump
will produce three files. 1 = I1, 2=R1, 3=R3.Additional samples: https://www.ncbi.nlm.nih.gov/sra/SRX13290059[accn]
Your sample (SRR17102621) creates 3 files with this code
while all these variations of
fasterq-dump
produces just 1 fileNow for my example (SRR10419617), both tools produce 2 fastq files while I should get 3 (I think). I wonder if there might be more to it than incorrectly upload SRA file.
It seems clear that
fasterq-dump
should not be used with 10x data at all since others have reported similar issues.You could try emailing SRA help desk and ask them about your specific accession. Tell them that the "Data Access" tab shows the three correct files so the submitters probably did the right upload. You can enumerate the problems with
*-dump
programs and that you can't download the original files without paying.I contacted sra-tools and I finally have a solution. This seems to work for the "good" 10x SRA files.
Including
--split-files
and--include-technical
seems to be critical. It doesn't work if--split-3
is used. Not exactly sure what that does anyway. For this sample, using prefetch followed by fasterq-dump with 18 cores produced 3 fastq files (59GB total) in 1.5 hours.I don't have a solution for the "bad" files yet. 12 of 13 experiments that I have looked at seems to be "bad" 10x SRA files (PRJNA330719, PRJNA400576, PRJNA548726, PRJNA558893, PRJNA588461, PRJNA593249, PRJNA625951, PRJNA647809, PRJNA661274, PRJNA682432, PRJNA700854, PRJNA700856). I only checked the first SRR for each experiment.
You should be able to get good fastqs from 10xGenomics.
I mean SRA files.
It is variable. Some runs are fine. People at times will also submit BAM files from cellranger that can be used to reconstitute the fastqs properly.
Can't they be obtained from the ENA instead (https://www.ebi.ac.uk/ena/browser/view/SRX7117651?show=reads)?
sra-exporer gives the following download script to fetch from ENA:
Problem is that file 1 from ENA is the same illumina index (at least at beginning):
ahhh .... the level to which single-cell RNA-seq data is often rendered useless during upload to these archives is truly astounding.