Question

fastq-dump returns I1 and R1 files instead of R1 and R2

0

Entering edit mode

4.5 years ago

C_sinensis ▴ 30

I have downloaded the data corresponding to SRR8712342 by doing:

prefetch SRR8712342

I than tried to get the fastq files with fastq-dump. Because it is a 10x scRNA-seq data set, I used the following options:

fastq-dump --split-files --outdir fastq --gzip --readids --read-filter pass --dumpbase --clip --defline-seq '@$ac.$si.$sg/$ri' --defline-qual '+' SRR8712342/SRR8712342.sra -I

I get two fastq files (SRR8712342_pass_2.fastq.gz and SRR8712342_pass_3.fastq.gz). The number 2 corresponds to what I think is the usually called R1. Here are the first reads:

@SRR8712342.1.CACGCCTT/2
NTTTTGGGCCCCTACTCTATTCCTTTTATGCAAACCTCACAGAATTTTAACCAGAAAGGCCAGGCAGGATGGCTCACGCCTGTAATCACAGCGCTTTG
+
#AAAAEE<</<EEEEEEEEEEE/A<EEAEEA<///EEE/E/E/AEEEE<///E////A<E/<E<E/EE/EEAEE</A/A/E/<<//E/A/AE//////
@SRR8712342.2.CACGCCTT/2
NAAGAGGAACTGCTGGCCACGAGTACGGGGTGTGGCCATGAATCCTGTGGAGCATCCTTTTGGAGGTGGCAACCACCAGCACATCGGCAAGCCCTCCA
+
#AAAAEEEEEEEEEEEEEEEE<EEEEAEEEEEEEAEEEEE</EEEE<EEEAEEEEEEE<AEE//EEEEAE/<EEAEEEAE6EEEAAAEA/6EEEEEE/
@SRR8712342.3.CACGCCTT/2
NTGAAGATCATGCTGCCCTGGGACCCAACTGGTAAGATTGGCCCTAAGAAGCCCCTGCCTGACCACGTGAGCATTGTGGAACCCAAAGATGAGATACT

However, I think the number 3 contains the indices (the, I think, so-called I1 file) instead of the barcodes and UMIs (R2 file). Here are the first reads:

@SRR8712342.sra.1 NB500934:132:HNVKHBGX2:1:11101:1446:1079 length=8
CACGCCTT
+SRR8712342.sra.1 NB500934:132:HNVKHBGX2:1:11101:1446:1079 length=8
AAAAAEAA
@SRR8712342.sra.2 NB500934:132:HNVKHBGX2:1:11101:7122:1079 length=8
CACGCCTT
+SRR8712342.sra.2 NB500934:132:HNVKHBGX2:1:11101:7122:1079 length=8
AAAAAEEE
@SRR8712342.sra.3 NB500934:132:HNVKHBGX2:1:11101:5641:1080 length=8
CACGCCTT
+SRR8712342.sra.3 NB500934:132:HNVKHBGX2:1:11101:5641:1080 length=8
AAAAAEEE
@SRR8712342.sra.4 NB500934:132:HNVKHBGX2:1:11101:17715:1080 length=8
CACGCCTT
+SRR8712342.sra.4 NB500934:132:HNVKHBGX2:1:11101:17715:1080 length=8
A//AA///
@SRR8712342.sra.5 NB500934:132:HNVKHBGX2:1:11101:22159:1080 length=8
CACGCCTT
+SRR8712342.sra.5 NB500934:132:HNVKHBGX2:1:11101:22159:1080 length=8
AAAAAEEE

I know the R1 reads are in there (see https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8712342, ticking both technical and biological reads). How can I tell fastq-dump to retrieve the correct files?

fastq-dump • 3.0k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 4.5 years ago by C_sinensis ▴ 30

score 0 · Answer 1 · 2020-11-03

0

Entering edit mode

4.5 years ago

GenoMax 151k

I am able to get three files using fastq-dump. Using sra-toolkit v.2.10.5.

fastq-dump -F --split-files SRR8712342.sra

@NB500934:132:HNVKHBGX2:1:11101:1446:1079
CGAGCNCGTAAGGATTTTTCAGAATG
+NB500934:132:HNVKHBGX2:1:11101:1446:1079
AAAAA#AEEEEEEEEEEEEEEEEEEE

@NB500934:132:HNVKHBGX2:1:11101:1446:1079
NTTTTGGGCCCCTACTCTATTCCTTTTATGCAAACCTCACAGAATTTTAACCAGAAAGGCCAGGCAGGATGGCTCACGCCTGTAATCACAGCGCTTTG
+NB500934:132:HNVKHBGX2:1:11101:1446:1079
#AAAAEE<</<EEEEEEEEEEE/A<EEAEEA<///EEE/E/E/AEEEE<///E////A<E/<E<E/EE/EEAEE</A/A/E/<<//E/A/AE//////

@NB500934:132:HNVKHBGX2:1:11101:1446:1079
CACGCCTT
+NB500934:132:HNVKHBGX2:1:11101:1446:1079
AAAAAEAA

ADD COMMENT • link 4.5 years ago by GenoMax 151k

0

Entering edit mode

Oh, interesting, thanks a lot! Do you happen to know which of my fastq-dump options is the culprit for getting only 2 files?

ADD REPLY • link 4.5 years ago by C_sinensis ▴ 30

0

Entering edit mode

I got curious and I looked up SRR8712342 on ENA archive. It looks there is only 1 fastq file there even if the run is paired-end. All reads have length 98bp and read names all end with "/2" which (sometimes) means read 2. Odd... I don't know if it has anything to do with scRNAseq with which I'm not familiar with...

ADD REPLY • link 4.5 years ago by dariober 15k

1

Entering edit mode

10x data is starting to look like a dumpster fire on both NCBI (and ENA). If the original BAM's are available then that is the safest way to go.

Note: ENA seems to be using the default /2 designation of old. Likely to designate that it is R2 which is the actual read in 10x.

ADD REPLY • link 4.5 years ago by GenoMax 151k