Question

SRA Toolkit - three fastq files - which is which?

0

Entering edit mode

3 months ago

Neat • 0

Hello everyone,

I'm currently trying to figure out how the SRA Toolkit works, but I'm confused by the files I'm getting.

I tried the following code, I also used --split-3 which didn't change the results:

./prefetch SRR17607594 -p --max-size 300g -O /data/XXXX/sratoolkit_Download

./fasterq-dump /data/XXXX/sratoolkit_Download/SRR17607594 --split-files -p --threads 80 -O /data/XXXX/scRNA/Lee_2024_PRJNA796513_Test_1 --include-technical --temp /data/XXXX/sratoolkit_Temp
--> Output:     SRR17607594_1.fastq (81,2 GiB)
        SRR17607594_2.fastq (227,8 GiB)
        SRR17607594_3.fastq (227,8 GiB)

./fasterq-dump /data/XXXX/sratoolkit_Download/SRR17607594 --split-files -p --threads 80 -O /data/XXXX/scRNA/Lee_2024_PRJNA796513_Test_2 --temp /data/XXXX/sratoolkit_Temp
    --> Output:     SRR17607594_2.fastq (227,8 GiB)
                    SRR17607594_3.fastq (227,8 GiB)

As you can see I get 2-3 files, but I can't find the suffixes "_1/_2/_3" in any of the official documentation and I don't know how to handle that.

My issue is similar to this post: Three fastq files

There user ATpoint could deduce the different files by looking at the the reads:

_1 is only few bp so it must be the index => ignore it
_2 is 28bp so it must be R1 with CB/UMI
_3 is 91 so that is R2 with the gene expression

But in my case the Metadata says 8, 151, 151 for all the SRRs (SRR17607594-99) and that confuses me. So which is which?

I also don't understand why it says that R1 and R2 have different files-sizes in "Original format" when my resulting files (which feel kind of large) are exactly the same size:

https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR17607594&display=data-access

I hope someone can help me / explain to me what's going on!

Thanks a lot in advance!

Neat

scrna sratoolkit • 943 views

ADD COMMENT • link updated 12 weeks ago by GenoMax 152k • written 3 months ago by Neat • 0

score 3 · Answer 1 · 2025-04-09

You are making a valid point about a problem that generally affects reproducibility.

Fastq-dump and SRA tools, in general, are an utterly misdesigned, nonsensical suite of programs that are the scourge of bioinformatics.

The simple act of downloading a files should not be this confusing, crytpic and require random binary programs where you can't even tell beforehand what is in each file.

In your case, after inspecting the output it looks like the two large files are the paired reads and their different size reflects their compression rate. It seems the second pair has data that compresses differently.

For what is worth, I get a different filesize when querying via Ensembl with bio:

bio search SRR17607594

prints:

[
    {
        "run_accession": "SRR17607594",
        "sample_accession": "SAMN24891916",
        "sample_alias": "CN25-T",
        "sample_description": "Human sample from Homo sapiens",
        "first_public": "2022-08-22",
        "country": "",
        "scientific_name": "Homo sapiens",
        "fastq_bytes": "47696043951;33618545069",
        "base_count": "163951599482",
        "read_count": "542886091",
        "library_name": "Single nuclei RNA-CN25-Tumor",
        "library_strategy": "OTHER",
        "library_source": "TRANSCRIPTOMIC SINGLE CELL",
        "library_layout": "PAIRED",
        "instrument_platform": "ILLUMINA",
        "instrument_model": "Illumina HiSeq 4000",
        "study_title": "Radial glial cell signatures with FGFR3 hypomethylation and overexpression characterize central neurocytoma",
        "fastq_url": [
            "https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR176/094/SRR17607594/SRR17607594_1.fastq.gz",
            "https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR176/094/SRR17607594/SRR17607594_2.fastq.gz"
        ],
        "info": "48 GB, 34 GB files; 542.9 million reads; 163951.6 million sequenced bases"
    }
]

score 1 · Answer 2 · 2025-04-09

1

Entering edit mode

3 months ago

GenoMax 152k

8, 151, 151

Even though the recommendation from 10x is to sequence R1 as 26 or 28 bp ( and R2 as ~ 98 bp) many times sequencing companies will include single cell samples as part of a pool that may include regular sequencing samples. As a result the R1/R2 will become the same maximum length as whatever else is running in that pool.

In your case, the first read is indeed the illumina index, which is not useful for downstream analysis. If you are planning to use cellranger then it will use parts of read 1 (in your case _2 file) and read 2 (_3 file) that it needs.

why it says that R1 and R2 have different files-sizes

Never depend on file sizes as a QC metric. File sizes should only be used for if the data is present (file not empty). Depending on sequence, files will compress to different sizes and there will be size differences even if the files have same exact number of reads/bases.

ADD COMMENT • link 3 months ago by GenoMax 152k

0

Entering edit mode

This comment was very helpful for me. After Reading up on the topic I now understand way more about whats going on here. What I still can't figure out though is how you were able to say that the _2 file is read 1 and the _3 file is read 2? They both have the same read length of 151 (which I also checked with seqkit stats -a XXXX.fastq), so that I can't deduce from that which one is R1 and R2. How do you do that? Or does cellranger not care if they are accidentally switched?

Thanks again!

ADD REPLY • link 12 weeks ago by Neat • 0

1

Entering edit mode

Since you point cellranger to a location that contains the fastq files it should be able to figure out what it needs. If the files were submitted switched (which would be cruel and unusual) then cellranger will throw an error, since it will not be able to find any valid barcodes in the switched file. You will see something like the following.

Log message:
An extremely low rate of correct barcodes was observed for all the candidate chemistry choices for the input: Sample SRR17607594 in "/xxx/test". Please check your input data.

ADD REPLY • link 12 weeks ago by GenoMax 152k