Hi
I am trying to get paired-end fastqs from a number of dbgap-restricted SRA files and am unsure if my output files are correct. Basically the process I've followed it to use SRA Toolkit (version 10.8.3 running on Ubuntu) to prefetch the files, validate the download with vdb-validate, and then convert this .sra into fastq. I have used both fasterq-dump and fastq-dump to achieve this and my output fastq files from each are of different sizes.
The steps I'm taking are as follows using SRR1293521 as an example:
Download SRA file:
./prefetch --ngc prj_26006.ngc SRR1293521
This succeeds with no errors
Validate SRA download
./vdb-validate --ngc prj_26006.ngc SRR1293521/SRR1293521_dbGaP-26006.sra
All validation tests are passed.
Convert to fastq with fasterq-dump: I first make a copy of SRR1293521_dbGaP-26006.sra and rename the file SRR1293521 because it fails with the default name.
./fasterq-dump --ngc prj_26006.ngc SRR1293521/SRR1293521
Output:
spots read : 99,531,818
reads read : 199,063,636
reads written : 120,934,449
Resulting in 3 files:
SRR1293521_1.fastq 5.3GB
SRR1293521_2.fastq 5.3GB
SRR1293521.fastq 19.4GB
Convert to fastq with fastq-dump: I use split-e here because instead of split-3 because it's a typo in the current codebase. and I use --skip-technical because according to this page, that should make this command functionally identical to the above fasterq-dump command.
./fastq-dump --split-e --skip-technical --ngc prj_26006.ngc SRR1293521/SRR1293521_dbGaP-26006.sra
Output:
Rejected 78129187 READS because READLEN < 1
Read 99531818 spots for SRR1293521/SRR1293521_dbGaP-26006.sra
Written 99531818 spots for SRR1293521/SRR1293521_dbGaP-26006.sra
Resulting in 3 files:
SRR1293521_dbGaP-26006_1.fastq 5.8GB
SRR1293521_dbGaP-26006_2.fastq 5.8GB
SRR1293521_dbGaP-26006.fastq 21.3GB
Is it expected to get different output from what I assumed were these functionally equivalent commands? If so, how do I know which fastq is the correct one? Usually I would download the raw fastq from ebi to cross-check but because it's a protected file this option isn't available. Also, would --split-files (resulting in 2 fastqs) be more suited than --split-e for this file?
Any suggestions would be much appreciated!
I've checked what you asked using R1 as an example - both files are 85610524 lines long so will have the same amount of reads. The header information is different based off the names of the input file. I've run head on both files and using fasterq-dump, we have an example first read of:
Whereas fastq-dump with it's longer input filenames has an example first read of:
So you were spot on - the only difference between these two is the length of the header (due to the different input file names) and that would account for why the file sizes are different. Thanks for your insight with this.