SRR download using fasterq-dump

0

Entering edit mode

19 months ago

Biomed-jeh ▴ 70

Hello,

I have downloaded the sra-toolkit from Anaconda (https://anaconda.org/bioconda/sra-tools) and downloaded an .sra file using the command: prefetch SRR20073591. The .sra file is located here: /faststorage/project/Biof/testdir/SRR20073591/SRR20073591.sra. When I navigate to the directory and use this command: fasterq-dump SRR20073591.sra, I get an output file called SRR20073591.fastq. However, I was expecting to get separate R1 and R2 sequencing files as well.

I would have expected R1, R2 and _3 index file using fasterq-dump SRR20073591, but I still only get the one SRR20073591.fastq file.

Would anyone be kind enough to assist me with this issue?

GEO SRR • 5.8k views

ADD COMMENT • link 19 months ago by Biomed-jeh ▴ 70

1

Entering edit mode

Use --split-files option to get the three files.

I was able to get the three files using fastq-dump --split-files SRR20073591.

ADD REPLY • link 19 months ago by GenoMax 151k

0

Entering edit mode

You are using the fastq-dump command, I am looking to use the fasterq-dump. But if we for a short time stay on the fastq-dump command and use the --split-files, I indeed get 3 files. However those 3 files does not output what I expected.

If I examine SRR20073591_1.fastq with the head command, I see this:

@SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=8

TAACAAGG

+SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=8

FFFFFFFF

@SRR20073591.2 A00794:315:HLF5JDSXY:1:1101:1551:1016 length=8

TAACAAGG

+SRR20073591.2 A00794:315:HLF5JDSXY:1:1101:1551:1016 length=8

FFFFFFFF

This is totally unexpected that the read length is only 8 bp, because I would expect an average of 126 (https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA857436&o=acc_s%3Aa)

Back to fasterq-dump I also attempted to use the fasterq-dump --split-files command, but this only creates the SRR20073591_3.fastq. Where is the R1 and R2?

ADD REPLY • link 19 months ago by Biomed-jeh ▴ 70

0

Entering edit mode

This is because the first file (_1) is Illumina index. Second file (_2) is cell barcodes + UMI and final file (_3) is the actual RNA. This is single cell RNAseq data.

$ more SRR20073591_*
::::::::::::::
SRR20073591_1.fastq
::::::::::::::
@SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=8
TAACAAGG
+SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=8
FFFFFFFF
::::::::::::::
SRR20073591_2.fastq
::::::::::::::
@SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=28
GNAATCGTCCCGTCAAGGTGATTGATAA
+SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=28
F#FFFFFFFFFFFFFFFFFFFFFFFFFF
::::::::::::::
SRR20073591_3.fastq
::::::::::::::
@SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=90
GTTCAATTTTTAGCACCAACTACCAACTTCTGGCAGTTCACATGCACCTGCACTTCCATGTCCAGGGGATTTGGCATCCTCTCATGGTTC
+SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=90
FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Use -F is you want to get original Illumina format read headers minus the SRR*.

ADD REPLY • link 19 months ago by GenoMax 151k

0

Entering edit mode

I was not aware that I had to use the _2 and _3 file because other guides tells otherwise, example: https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump

The guide says that we should expect 3 files, but we only get 1. (even if we use fasterq-files --split-files).

Is fastq-dump --split-files the only option? and if so, should I use _2 for R1 and _3 for R2?

ADD REPLY • link 19 months ago by Biomed-jeh ▴ 70

0

Entering edit mode

Is fastq-dump --split-files the only option? and if so, should I use _2 for R1 and _3 for R2?

I will say "yes" to the second part of that question.

You could directly download the original data files submitted (3 fastq files) from AWS, if you are able. You can see the links for s3 bucket under the Data Access tab here: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR20073591&display=data-access (scroll down to "Original Format" section).

You could try to prefetch the SRA files and then dump with fasterq.

ADD REPLY • link 19 months ago by GenoMax 151k

0

Entering edit mode

Alright, fastq-dump --split-files is slowly downloading the 3 files.

However, when I use $more SRR20073591.fastq i get:

@SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=90
GTTCAATTTTTAGCACCAACTACCAACTTCTGGCAGTTCACATGCACCTGCACTTCCATGTCCAGGGGATTTGGCATCCTCTCATGGTTC
+SRR20073591.1 A00794:315:HLF5JDSXY:1:1101:1371:1016 length=90
FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@SRR20073591.2 A00794:315:HLF5JDSXY:1:1101:1551:1016 length=90
AAGGAAGTGAACAAAACCATCCAGAATGTAAAAATGAAAATAGAAACAATAAAGAAATCACAAACGGAGACAACCCTGGGCGATAGAAAA
+SRR20073591.2 A00794:315:HLF5JDSXY:1:1101:1551:1016 length=90
FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF

I assume this file is the rna-seq file. However I would have expected it to be named something similar to SRR20073591_R1.fastq.gz and SRR20073591_R2.fastq.gz. Is it possible that I just totally misunderstood everything, and that this SRR20073591_fastq file has to be altered into the R1 and R2 file? if yes, how?

ADD REPLY • link 19 months ago by Biomed-jeh ▴ 70

1

Entering edit mode

Correct _3 file is the RNA read. You have to rename the files accordingly. Index read is generally not needed but these submitters appear to have included it as a separate file displacing normal R1/R2 files into _2/_3 spot. You will need to consider the file with cell barcodes + UMI's for any analysis you are planning to do.

ADD REPLY • link 19 months ago by GenoMax 151k

0

Entering edit mode

fasterq-dump --include-technical
fastq-dump --split-files

These commands appear to be identical in terms of the files they download. However, it was significantly faster to download fastq files using the fasterq-dump command compared to fastq-dump.

Thank you for the help and clarification of the different files, it was a tremendous help :)

ADD REPLY • link 19 months ago by Biomed-jeh ▴ 70

0

Entering edit mode

Biomed-jeh You were literally having a three-hour conversation with an expert who is taking the time to write several detailed messages, and all of that without any acknowledgment in writing or through upvotes on your part. Neither GenoMax nor most of us are helping others for a pat on the back, but what about basic manners? Is everyone's help these days taken for granted?

ADD REPLY • link 19 months ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Mensur Dlakic I was working on this project in the late hours yesterday, went to sleep and woke up 2 hours ago to continue my work testing the suggestions. I am sorry that I did not have time to spent my entire night working on these suggestions and give my acknowledgement and gratitude instantly. In regards to the solutions, Kenneth Durbrow from the SRA-toolkit team mentioned a solution which I am also testing right now. I rather want to close this post with a definitive answer to this topic when I know for sure what the solution is along with the acknowledgments.

ADD REPLY • link 19 months ago by Biomed-jeh ▴ 70

0

Entering edit mode

Never mind the validity of the solution - how about an acknowledgement for the time spent helping you? Notice that you are still talking about yourself here and composing an argument to me rather than thanking the person who has been helping you. It is not that difficult, and it takes less time to do the right thing.

ADD REPLY • link 19 months ago by Mensur Dlakic ★ 29k

0

Entering edit mode

19 months ago

Harsha • 0

Check this from edwards lab : https://edwards.flinders.edu.au/fastq-dump/

ADD COMMENT • link 19 months ago by Harsha • 0

0

Entering edit mode

Looking to use the fasterq-dump command, i added a reply to GenoMax comment above. But thanks for the link, it gives a fine description, unfortunately it does not solve the current issue.

ADD REPLY • link 19 months ago by Biomed-jeh ▴ 70

0

Entering edit mode

check this gist

	#!/bin/bash
	# Usage: deinterleave_fastq.sh < interleaved.fastq f.fastq r.fastq [compress]
	#
	# Deinterleaves a FASTQ file of paired reads into two FASTQ
	# files specified on the command line. Optionally GZip compresses the output
	# FASTQ files using pigz if the 3rd command line argument is the word "compress"
	#
	# Can deinterleave 100 million paired reads (200 million total
	# reads; a 43Gbyte file), in memory (/dev/shm), in 4m15s (255s)
	#
	# Latest code: https://gist.github.com/3521724
	# Also see my interleaving script: https://gist.github.com/4544979
	#
	# Inspired by Torsten Seemann's blog post:
	# http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html

	# Set up some defaults
	GZIP_OUTPUT=0
	PIGZ_COMPRESSION_THREADS=10

	# If the third argument is the word "compress" then we'll compress the output using pigz
	if [[ $3 == "compress" ]]; then
	GZIP_OUTPUT=1
	fi

	if [[ ${GZIP_OUTPUT} == 0 ]]; then
	paste - - - - - - - - \| tee >(cut -f 1-4 \| tr "\t" "\n" > $1) \| cut -f 5-8 \| tr "\t" "\n" > $2
	else
	paste - - - - - - - - \| tee >(cut -f 1-4 \| tr "\t" "\n" \| pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $1) \| cut -f 5-8 \| tr "\t" "\n" \| pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $2
	fi

view raw deinterleave_fastq.sh hosted with ❤ by GitHub

or also seqkit might give some insights

ADD REPLY • link 19 months ago by Harsha • 0

Login before adding your answer.