SRR download using fasterq-dump
1
0
Entering edit mode
13 months ago
Biomed-jeh ▴ 70

Hello,

I have downloaded the sra-toolkit from Anaconda (https://anaconda.org/bioconda/sra-tools) and downloaded an .sra file using the command: prefetch SRR20073591. The .sra file is located here: /faststorage/project/Biof/testdir/SRR20073591/SRR20073591.sra. When I navigate to the directory and use this command: fasterq-dump SRR20073591.sra, I get an output file called SRR20073591.fastq. However, I was expecting to get separate R1 and R2 sequencing files as well.

I would have expected R1, R2 and _3 index file using fasterq-dump SRR20073591, but I still only get the one SRR20073591.fastq file.

Would anyone be kind enough to assist me with this issue?

GEO SRR • 3.5k views
ADD COMMENT
1
Entering edit mode

Use --split-files option to get the three files.

I was able to get the three files using fastq-dump --split-files SRR20073591.

ADD REPLY
0
Entering edit mode

You are using the fastq-dump command, I am looking to use the fasterq-dump. But if we for a short time stay on the fastq-dump command and use the --split-files, I indeed get 3 files. However those 3 files does not output what I expected.

If I examine SRR20073591_1.fastq with the head command, I see this:

@SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=8

TAACAAGG

+SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=8

FFFFFFFF

@SRR20073591.2 A00794:315:HLF5JDSXY:1:1101:1551:1016 length=8

TAACAAGG

+SRR20073591.2 A00794:315:HLF5JDSXY:1:1101:1551:1016 length=8

FFFFFFFF

This is totally unexpected that the read length is only 8 bp, because I would expect an average of 126 (https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA857436&o=acc_s%3Aa)

Back to fasterq-dump I also attempted to use the fasterq-dump --split-files command, but this only creates the SRR20073591_3.fastq. Where is the R1 and R2?

ADD REPLY
0
Entering edit mode

This is because the first file (_1) is Illumina index. Second file (_2) is cell barcodes + UMI and final file (_3) is the actual RNA. This is single cell RNAseq data.

$ more SRR20073591_*
::::::::::::::
SRR20073591_1.fastq
::::::::::::::
@SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=8
TAACAAGG
+SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=8
FFFFFFFF
::::::::::::::
SRR20073591_2.fastq
::::::::::::::
@SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=28
GNAATCGTCCCGTCAAGGTGATTGATAA
+SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=28
F#FFFFFFFFFFFFFFFFFFFFFFFFFF
::::::::::::::
SRR20073591_3.fastq
::::::::::::::
@SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=90
GTTCAATTTTTAGCACCAACTACCAACTTCTGGCAGTTCACATGCACCTGCACTTCCATGTCCAGGGGATTTGGCATCCTCTCATGGTTC
+SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=90
FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Use -F is you want to get original Illumina format read headers minus the SRR*.

ADD REPLY
0
Entering edit mode

I was not aware that I had to use the _2 and _3 file because other guides tells otherwise, example: https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump

The guide says that we should expect 3 files, but we only get 1. (even if we use fasterq-files --split-files).

Is fastq-dump --split-files the only option? and if so, should I use _2 for R1 and _3 for R2?

ADD REPLY
0
Entering edit mode

Is fastq-dump --split-files the only option? and if so, should I use _2 for R1 and _3 for R2?

I will say "yes" to the second part of that question.

You could directly download the original data files submitted (3 fastq files) from AWS, if you are able. You can see the links for s3 bucket under the Data Access tab here: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR20073591&display=data-access (scroll down to "Original Format" section).

You could try to prefetch the SRA files and then dump with fasterq.

ADD REPLY
0
Entering edit mode

Alright, fastq-dump --split-files is slowly downloading the 3 files.

However, when I use $more SRR20073591.fastq i get:

@SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=90
GTTCAATTTTTAGCACCAACTACCAACTTCTGGCAGTTCACATGCACCTGCACTTCCATGTCCAGGGGATTTGGCATCCTCTCATGGTTC
+SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=90
FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@SRR20073591.2 A00794:315:HLF5JDSXY:1:1101:1551:1016 length=90
AAGGAAGTGAACAAAACCATCCAGAATGTAAAAATGAAAATAGAAACAATAAAGAAATCACAAACGGAGACAACCCTGGGCGATAGAAAA
+SRR20073591.2 A00794:315:HLF5JDSXY:1:1101:1551:1016 length=90
FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF

I assume this file is the rna-seq file. However I would have expected it to be named something similar to SRR20073591_R1.fastq.gz and SRR20073591_R2.fastq.gz. Is it possible that I just totally misunderstood everything, and that this SRR20073591_fastq file has to be altered into the R1 and R2 file? if yes, how?

ADD REPLY
1
Entering edit mode

Correct _3 file is the RNA read. You have to rename the files accordingly. Index read is generally not needed but these submitters appear to have included it as a separate file displacing normal R1/R2 files into _2/_3 spot. You will need to consider the file with cell barcodes + UMI's for any analysis you are planning to do.

ADD REPLY
0
Entering edit mode
fasterq-dump --include-technical
fastq-dump --split-files

These commands appear to be identical in terms of the files they download. However, it was significantly faster to download fastq files using the fasterq-dump command compared to fastq-dump.

Thank you for the help and clarification of the different files, it was a tremendous help :)

ADD REPLY
0
Entering edit mode

Biomed-jeh You were literally having a three-hour conversation with an expert who is taking the time to write several detailed messages, and all of that without any acknowledgment in writing or through upvotes on your part. Neither GenoMax nor most of us are helping others for a pat on the back, but what about basic manners? Is everyone's help these days taken for granted?

ADD REPLY
0
Entering edit mode

Mensur Dlakic I was working on this project in the late hours yesterday, went to sleep and woke up 2 hours ago to continue my work testing the suggestions. I am sorry that I did not have time to spent my entire night working on these suggestions and give my acknowledgement and gratitude instantly. In regards to the solutions, Kenneth Durbrow from the SRA-toolkit team mentioned a solution which I am also testing right now. I rather want to close this post with a definitive answer to this topic when I know for sure what the solution is along with the acknowledgments.

ADD REPLY
0
Entering edit mode

Never mind the validity of the solution - how about an acknowledgement for the time spent helping you? Notice that you are still talking about yourself here and composing an argument to me rather than thanking the person who has been helping you. It is not that difficult, and it takes less time to do the right thing.

ADD REPLY
0
Entering edit mode
13 months ago
Harsha • 0

Check this from edwards lab : https://edwards.flinders.edu.au/fastq-dump/

ADD COMMENT
0
Entering edit mode

Looking to use the fasterq-dump command, i added a reply to GenoMax comment above. But thanks for the link, it gives a fine description, unfortunately it does not solve the current issue.

ADD REPLY
0
Entering edit mode

check this gist or also seqkit might give some insights

ADD REPLY

Login before adding your answer.

Traffic: 2132 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6