Hello everyone,
I'm currently trying to figure out how the SRA Toolkit works, but I'm confused by the files I'm getting.
I tried the following code, I also used --split-3 which didn't change the results:
./prefetch SRR17607594 -p --max-size 300g -O /data/XXXX/sratoolkit_Download
./fasterq-dump /data/XXXX/sratoolkit_Download/SRR17607594 --split-files -p --threads 80 -O /data/XXXX/scRNA/Lee_2024_PRJNA796513_Test_1 --include-technical --temp /data/XXXX/sratoolkit_Temp
--> Output: SRR17607594_1.fastq (81,2 GiB)
SRR17607594_2.fastq (227,8 GiB)
SRR17607594_3.fastq (227,8 GiB)
./fasterq-dump /data/XXXX/sratoolkit_Download/SRR17607594 --split-files -p --threads 80 -O /data/XXXX/scRNA/Lee_2024_PRJNA796513_Test_2 --temp /data/XXXX/sratoolkit_Temp
--> Output: SRR17607594_2.fastq (227,8 GiB)
SRR17607594_3.fastq (227,8 GiB)
As you can see I get 2-3 files, but I can't find the suffixes "_1/_2/_3" in any of the official documentation and I don't know how to handle that.
My issue is similar to this post: Three fastq files
There user ATpoint could deduce the different files by looking at the the reads:
_1 is only few bp so it must be the index => ignore it
_2 is 28bp so it must be R1 with CB/UMI
_3 is 91 so that is R2 with the gene expression
But in my case the Metadata says 8, 151, 151 for all the SRRs (SRR17607594-99) and that confuses me. So which is which?
I also don't understand why it says that R1 and R2 have different files-sizes in "Original format" when my resulting files (which feel kind of large) are exactly the same size:
https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR17607594&display=data-access
I hope someone can help me / explain to me what's going on!
Thanks a lot in advance!
Neat
This comment was very helpful for me. After Reading up on the topic I now understand way more about whats going on here. What I still can't figure out though is how you were able to say that the _2 file is read 1 and the _3 file is read 2? They both have the same read length of 151 (which I also checked with seqkit stats -a XXXX.fastq), so that I can't deduce from that which one is R1 and R2. How do you do that? Or does cellranger not care if they are accidentally switched?
Thanks again!
Since you point
cellranger
to a location that contains the fastq files it should be able to figure out what it needs. If the files were submitted switched (which would be cruel and unusual) thencellranger
will throw an error, since it will not be able to find any valid barcodes in the switched file. You will see something like the following.