fastq-dump returns I1 and R1 files instead of R1 and R2
1
0
Entering edit mode
4.1 years ago
C_sinensis ▴ 30

I have downloaded the data corresponding to SRR8712342 by doing:

prefetch SRR8712342

I than tried to get the fastq files with fastq-dump. Because it is a 10x scRNA-seq data set, I used the following options:

fastq-dump --split-files --outdir fastq --gzip --readids --read-filter pass --dumpbase --clip --defline-seq '@$ac.$si.$sg/$ri' --defline-qual '+' SRR8712342/SRR8712342.sra -I

I get two fastq files (SRR8712342_pass_2.fastq.gz and SRR8712342_pass_3.fastq.gz). The number 2 corresponds to what I think is the usually called R1. Here are the first reads:

@SRR8712342.1.CACGCCTT/2
NTTTTGGGCCCCTACTCTATTCCTTTTATGCAAACCTCACAGAATTTTAACCAGAAAGGCCAGGCAGGATGGCTCACGCCTGTAATCACAGCGCTTTG
+
#AAAAEE<</<EEEEEEEEEEE/A<EEAEEA<///EEE/E/E/AEEEE<///E////A<E/<E<E/EE/EEAEE</A/A/E/<<//E/A/AE//////
@SRR8712342.2.CACGCCTT/2
NAAGAGGAACTGCTGGCCACGAGTACGGGGTGTGGCCATGAATCCTGTGGAGCATCCTTTTGGAGGTGGCAACCACCAGCACATCGGCAAGCCCTCCA
+
#AAAAEEEEEEEEEEEEEEEE<EEEEAEEEEEEEAEEEEE</EEEE<EEEAEEEEEEE<AEE//EEEEAE/<EEAEEEAE6EEEAAAEA/6EEEEEE/
@SRR8712342.3.CACGCCTT/2
NTGAAGATCATGCTGCCCTGGGACCCAACTGGTAAGATTGGCCCTAAGAAGCCCCTGCCTGACCACGTGAGCATTGTGGAACCCAAAGATGAGATACT

However, I think the number 3 contains the indices (the, I think, so-called I1 file) instead of the barcodes and UMIs (R2 file). Here are the first reads:

@SRR8712342.sra.1 NB500934:132:HNVKHBGX2:1:11101:1446:1079 length=8
CACGCCTT
+SRR8712342.sra.1 NB500934:132:HNVKHBGX2:1:11101:1446:1079 length=8
AAAAAEAA
@SRR8712342.sra.2 NB500934:132:HNVKHBGX2:1:11101:7122:1079 length=8
CACGCCTT
+SRR8712342.sra.2 NB500934:132:HNVKHBGX2:1:11101:7122:1079 length=8
AAAAAEEE
@SRR8712342.sra.3 NB500934:132:HNVKHBGX2:1:11101:5641:1080 length=8
CACGCCTT
+SRR8712342.sra.3 NB500934:132:HNVKHBGX2:1:11101:5641:1080 length=8
AAAAAEEE
@SRR8712342.sra.4 NB500934:132:HNVKHBGX2:1:11101:17715:1080 length=8
CACGCCTT
+SRR8712342.sra.4 NB500934:132:HNVKHBGX2:1:11101:17715:1080 length=8
A//AA///
@SRR8712342.sra.5 NB500934:132:HNVKHBGX2:1:11101:22159:1080 length=8
CACGCCTT
+SRR8712342.sra.5 NB500934:132:HNVKHBGX2:1:11101:22159:1080 length=8
AAAAAEEE

I know the R1 reads are in there (see https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8712342, ticking both technical and biological reads). How can I tell fastq-dump to retrieve the correct files?

fastq-dump • 2.8k views
ADD COMMENT
0
Entering edit mode
4.1 years ago
GenoMax 147k

I am able to get three files using fastq-dump. Using sra-toolkit v.2.10.5.

fastq-dump -F --split-files SRR8712342.sra

@NB500934:132:HNVKHBGX2:1:11101:1446:1079
CGAGCNCGTAAGGATTTTTCAGAATG
+NB500934:132:HNVKHBGX2:1:11101:1446:1079
AAAAA#AEEEEEEEEEEEEEEEEEEE

@NB500934:132:HNVKHBGX2:1:11101:1446:1079
NTTTTGGGCCCCTACTCTATTCCTTTTATGCAAACCTCACAGAATTTTAACCAGAAAGGCCAGGCAGGATGGCTCACGCCTGTAATCACAGCGCTTTG
+NB500934:132:HNVKHBGX2:1:11101:1446:1079
#AAAAEE<</<EEEEEEEEEEE/A<EEAEEA<///EEE/E/E/AEEEE<///E////A<E/<E<E/EE/EEAEE</A/A/E/<<//E/A/AE//////

@NB500934:132:HNVKHBGX2:1:11101:1446:1079
CACGCCTT
+NB500934:132:HNVKHBGX2:1:11101:1446:1079
AAAAAEAA
ADD COMMENT
0
Entering edit mode

Oh, interesting, thanks a lot! Do you happen to know which of my fastq-dump options is the culprit for getting only 2 files?

ADD REPLY
0
Entering edit mode

I got curious and I looked up SRR8712342 on ENA archive. It looks there is only 1 fastq file there even if the run is paired-end. All reads have length 98bp and read names all end with "/2" which (sometimes) means read 2. Odd... I don't know if it has anything to do with scRNAseq with which I'm not familiar with...

ADD REPLY
1
Entering edit mode

10x data is starting to look like a dumpster fire on both NCBI (and ENA). If the original BAM's are available then that is the safest way to go.

Note: ENA seems to be using the default /2 designation of old. Likely to designate that it is R2 which is the actual read in 10x.

ADD REPLY

Login before adding your answer.

Traffic: 2684 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6