Hello,
I have downloaded the sra-toolkit from Anaconda (https://anaconda.org/bioconda/sra-tools) and downloaded an .sra file using the command: prefetch SRR20073591
.
The .sra file is located here: /faststorage/project/Biof/testdir/SRR20073591/SRR20073591.sra.
When I navigate to the directory and use this command: fasterq-dump SRR20073591.sra
, I get an output file called SRR20073591.fastq. However, I was expecting to get separate R1 and R2 sequencing files as well.
I would have expected R1, R2 and _3 index file using fasterq-dump SRR20073591
, but I still only get the one SRR20073591.fastq file.
Would anyone be kind enough to assist me with this issue?
Use
--split-files
option to get the three files.I was able to get the three files using
fastq-dump --split-files SRR20073591
.You are using the fastq-dump command, I am looking to use the fasterq-dump. But if we for a short time stay on the fastq-dump command and use the --split-files, I indeed get 3 files. However those 3 files does not output what I expected.
If I examine SRR20073591_1.fastq with the head command, I see this:
@SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=8
TAACAAGG
+SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=8
FFFFFFFF
@SRR20073591.2 A00794:315:HLF5JDSXY:1:1101:1551:1016 length=8
TAACAAGG
+SRR20073591.2 A00794:315:HLF5JDSXY:1:1101:1551:1016 length=8
FFFFFFFF
This is totally unexpected that the read length is only 8 bp, because I would expect an average of 126 (https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA857436&o=acc_s%3Aa)
Back to fasterq-dump I also attempted to use the fasterq-dump --split-files command, but this only creates the SRR20073591_3.fastq. Where is the R1 and R2?
This is because the first file (
_1
) is Illumina index. Second file (_2
) is cell barcodes + UMI and final file (_3
) is the actual RNA. This is single cell RNAseq data.Use
-F
is you want to get original Illumina format read headers minus theSRR*
.I was not aware that I had to use the _2 and _3 file because other guides tells otherwise, example: https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump
The guide says that we should expect 3 files, but we only get 1. (even if we use fasterq-files --split-files).
Is fastq-dump --split-files the only option? and if so, should I use _2 for R1 and _3 for R2?
I will say "yes" to the second part of that question.
You could directly download the original data files submitted (3 fastq files) from AWS, if you are able. You can see the links for s3 bucket under the
Data Access
tab here: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR20073591&display=data-access (scroll down to "Original Format" section).You could try to
prefetch
the SRA files and then dump withfasterq
.Alright, fastq-dump --split-files is slowly downloading the 3 files.
However, when I use $more SRR20073591.fastq i get:
I assume this file is the rna-seq file. However I would have expected it to be named something similar to SRR20073591_R1.fastq.gz and SRR20073591_R2.fastq.gz. Is it possible that I just totally misunderstood everything, and that this SRR20073591_fastq file has to be altered into the R1 and R2 file? if yes, how?
Correct
_3
file is the RNA read. You have to rename the files accordingly. Index read is generally not needed but these submitters appear to have included it as a separate file displacing normal R1/R2 files into_2/_3
spot. You will need to consider the file with cell barcodes + UMI's for any analysis you are planning to do.These commands appear to be identical in terms of the files they download. However, it was significantly faster to download fastq files using the
fasterq-dump
command compared tofastq-dump
.Thank you for the help and clarification of the different files, it was a tremendous help :)
Biomed-jeh You were literally having a three-hour conversation with an expert who is taking the time to write several detailed messages, and all of that without any acknowledgment in writing or through upvotes on your part. Neither GenoMax nor most of us are helping others for a pat on the back, but what about basic manners? Is everyone's help these days taken for granted?
Mensur Dlakic I was working on this project in the late hours yesterday, went to sleep and woke up 2 hours ago to continue my work testing the suggestions. I am sorry that I did not have time to spent my entire night working on these suggestions and give my acknowledgement and gratitude instantly. In regards to the solutions, Kenneth Durbrow from the SRA-toolkit team mentioned a solution which I am also testing right now. I rather want to close this post with a definitive answer to this topic when I know for sure what the solution is along with the acknowledgments.
Never mind the validity of the solution - how about an acknowledgement for the time spent helping you? Notice that you are still talking about yourself here and composing an argument to me rather than thanking the person who has been helping you. It is not that difficult, and it takes less time to do the right thing.