I have downloaded the data corresponding to SRR8712342 by doing:
prefetch SRR8712342
I than tried to get the fastq files with fastq-dump. Because it is a 10x scRNA-seq data set, I used the following options:
fastq-dump --split-files --outdir fastq --gzip --readids --read-filter pass --dumpbase --clip --defline-seq '@$ac.$si.$sg/$ri' --defline-qual '+' SRR8712342/SRR8712342.sra -I
I get two fastq files (SRR8712342_pass_2.fastq.gz and SRR8712342_pass_3.fastq.gz). The number 2 corresponds to what I think is the usually called R1. Here are the first reads:
@SRR8712342.1.CACGCCTT/2
NTTTTGGGCCCCTACTCTATTCCTTTTATGCAAACCTCACAGAATTTTAACCAGAAAGGCCAGGCAGGATGGCTCACGCCTGTAATCACAGCGCTTTG
+
#AAAAEE<</<EEEEEEEEEEE/A<EEAEEA<///EEE/E/E/AEEEE<///E////A<E/<E<E/EE/EEAEE</A/A/E/<<//E/A/AE//////
@SRR8712342.2.CACGCCTT/2
NAAGAGGAACTGCTGGCCACGAGTACGGGGTGTGGCCATGAATCCTGTGGAGCATCCTTTTGGAGGTGGCAACCACCAGCACATCGGCAAGCCCTCCA
+
#AAAAEEEEEEEEEEEEEEEE<EEEEAEEEEEEEAEEEEE</EEEE<EEEAEEEEEEE<AEE//EEEEAE/<EEAEEEAE6EEEAAAEA/6EEEEEE/
@SRR8712342.3.CACGCCTT/2
NTGAAGATCATGCTGCCCTGGGACCCAACTGGTAAGATTGGCCCTAAGAAGCCCCTGCCTGACCACGTGAGCATTGTGGAACCCAAAGATGAGATACT
However, I think the number 3 contains the indices (the, I think, so-called I1 file) instead of the barcodes and UMIs (R2 file). Here are the first reads:
@SRR8712342.sra.1 NB500934:132:HNVKHBGX2:1:11101:1446:1079 length=8
CACGCCTT
+SRR8712342.sra.1 NB500934:132:HNVKHBGX2:1:11101:1446:1079 length=8
AAAAAEAA
@SRR8712342.sra.2 NB500934:132:HNVKHBGX2:1:11101:7122:1079 length=8
CACGCCTT
+SRR8712342.sra.2 NB500934:132:HNVKHBGX2:1:11101:7122:1079 length=8
AAAAAEEE
@SRR8712342.sra.3 NB500934:132:HNVKHBGX2:1:11101:5641:1080 length=8
CACGCCTT
+SRR8712342.sra.3 NB500934:132:HNVKHBGX2:1:11101:5641:1080 length=8
AAAAAEEE
@SRR8712342.sra.4 NB500934:132:HNVKHBGX2:1:11101:17715:1080 length=8
CACGCCTT
+SRR8712342.sra.4 NB500934:132:HNVKHBGX2:1:11101:17715:1080 length=8
A//AA///
@SRR8712342.sra.5 NB500934:132:HNVKHBGX2:1:11101:22159:1080 length=8
CACGCCTT
+SRR8712342.sra.5 NB500934:132:HNVKHBGX2:1:11101:22159:1080 length=8
AAAAAEEE
I know the R1 reads are in there (see https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8712342, ticking both technical and biological reads). How can I tell fastq-dump to retrieve the correct files?
Oh, interesting, thanks a lot! Do you happen to know which of my fastq-dump options is the culprit for getting only 2 files?
I got curious and I looked up SRR8712342 on ENA archive. It looks there is only 1 fastq file there even if the run is paired-end. All reads have length 98bp and read names all end with "/2" which (sometimes) means read 2. Odd... I don't know if it has anything to do with scRNAseq with which I'm not familiar with...
10x data is starting to look like a dumpster fire on both NCBI (and ENA). If the original BAM's are available then that is the safest way to go.
Note: ENA seems to be using the default
/2
designation of old. Likely to designate that it isR2
which is the actual read in 10x.