Hello everyone,
I'm currently trying to figure out how the SRA Toolkit works, but I'm confused by the files I'm getting.
I tried the following code, I also used --split-3 which didn't change the results:
./prefetch SRR17607594 -p --max-size 300g -O /data/XXXX/sratoolkit_Download
./fasterq-dump /data/XXXX/sratoolkit_Download/SRR17607594 --split-files -p --threads 80 -O /data/XXXX/scRNA/Lee_2024_PRJNA796513_Test_1 --include-technical --temp /data/XXXX/sratoolkit_Temp
--> Output: SRR17607594_1.fastq (81,2 GiB)
SRR17607594_2.fastq (227,8 GiB)
SRR17607594_3.fastq (227,8 GiB)
./fasterq-dump /data/XXXX/sratoolkit_Download/SRR17607594 --split-files -p --threads 80 -O /data/XXXX/scRNA/Lee_2024_PRJNA796513_Test_2 --temp /data/XXXX/sratoolkit_Temp
--> Output: SRR17607594_2.fastq (227,8 GiB)
SRR17607594_3.fastq (227,8 GiB)
As you can see I get 2-3 files, but I can't find the suffixes "_1/_2/_3" in any of the official documentation and I don't know how to handle that.
My issue is similar to this post: Three fastq files
There user ATpoint could deduce the different files by looking at the the reads:
_1 is only few bp so it must be the index => ignore it
_2 is 28bp so it must be R1 with CB/UMI
_3 is 91 so that is R2 with the gene expression
But in my case the Metadata says 8, 151, 151 for all the SRRs (SRR17607594-99) and that confuses me. So which is which?
I also don't understand why it says that R1 and R2 have different files-sizes in "Original format" when my resulting files (which feel kind of large) are exactly the same size:
https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR17607594&display=data-access
I hope someone can help me / explain to me what's going on!
Thanks a lot in advance!
Neat