SRA Toolkit - three fastq files - which is which?
2
0
Entering edit mode
8 days ago
Neat • 0

Hello everyone,

I'm currently trying to figure out how the SRA Toolkit works, but I'm confused by the files I'm getting.

I tried the following code, I also used --split-3 which didn't change the results:

./prefetch SRR17607594 -p --max-size 300g -O /data/XXXX/sratoolkit_Download

./fasterq-dump /data/XXXX/sratoolkit_Download/SRR17607594 --split-files -p --threads 80 -O /data/XXXX/scRNA/Lee_2024_PRJNA796513_Test_1 --include-technical --temp /data/XXXX/sratoolkit_Temp
--> Output:     SRR17607594_1.fastq (81,2 GiB)
        SRR17607594_2.fastq (227,8 GiB)
        SRR17607594_3.fastq (227,8 GiB)

./fasterq-dump /data/XXXX/sratoolkit_Download/SRR17607594 --split-files -p --threads 80 -O /data/XXXX/scRNA/Lee_2024_PRJNA796513_Test_2 --temp /data/XXXX/sratoolkit_Temp
    --> Output:     SRR17607594_2.fastq (227,8 GiB)
                    SRR17607594_3.fastq (227,8 GiB)

As you can see I get 2-3 files, but I can't find the suffixes "_1/_2/_3" in any of the official documentation and I don't know how to handle that.

My issue is similar to this post: Three fastq files

There user ATpoint could deduce the different files by looking at the the reads:

_1 is only few bp so it must be the index => ignore it
_2 is 28bp so it must be R1 with CB/UMI
_3 is 91 so that is R2 with the gene expression

But in my case the Metadata says 8, 151, 151 for all the SRRs (SRR17607594-99) and that confuses me. So which is which?

I also don't understand why it says that R1 and R2 have different files-sizes in "Original format" when my resulting files (which feel kind of large) are exactly the same size:

https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR17607594&display=data-access

I hope someone can help me / explain to me what's going on!

Thanks a lot in advance!

Neat

scrna sratoolkit • 358 views
ADD COMMENT
3
Entering edit mode
8 days ago

You are making a valid point about a problem that generally affects reproducibility.

Fastq-dump and SRA tools, in general, are an utterly misdesigned, nonsensical suite of programs that are the scourge of bioinformatics.

The simple act of downloading a files should not be this confusing, crytpic and require random binary programs where you can't even tell beforehand what is in each file.

In your case, after inspecting the output it looks like the two large files are the paired reads and their different size reflects their compression rate. It seems the second pair has data that compresses differently.

For what is worth, I get a different filesize when querying via Ensembl with bio:

bio search SRR17607594

prints:

[
    {
        "run_accession": "SRR17607594",
        "sample_accession": "SAMN24891916",
        "sample_alias": "CN25-T",
        "sample_description": "Human sample from Homo sapiens",
        "first_public": "2022-08-22",
        "country": "",
        "scientific_name": "Homo sapiens",
        "fastq_bytes": "47696043951;33618545069",
        "base_count": "163951599482",
        "read_count": "542886091",
        "library_name": "Single nuclei RNA-CN25-Tumor",
        "library_strategy": "OTHER",
        "library_source": "TRANSCRIPTOMIC SINGLE CELL",
        "library_layout": "PAIRED",
        "instrument_platform": "ILLUMINA",
        "instrument_model": "Illumina HiSeq 4000",
        "study_title": "Radial glial cell signatures with FGFR3 hypomethylation and overexpression characterize central neurocytoma",
        "fastq_url": [
            "https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR176/094/SRR17607594/SRR17607594_1.fastq.gz",
            "https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR176/094/SRR17607594/SRR17607594_2.fastq.gz"
        ],
        "info": "48 GB, 34 GB files; 542.9 million reads; 163951.6 million sequenced bases"
    }
]
ADD COMMENT
1
Entering edit mode
8 days ago
GenoMax 150k

8, 151, 151

Even though the recommendation from 10x is to sequence R1 as 26 or 28 bp ( and R2 as ~ 98 bp) many times sequencing companies will include single cell samples as part of a pool that may include regular sequencing samples. As a result the R1/R2 will become the same maximum length as whatever else is running in that pool.

In your case, the first read is indeed the illumina index, which is not useful for downstream analysis. If you are planning to use cellranger then it will use parts of read 1 (in your case _2 file) and read 2 (_3 file) that it needs.

why it says that R1 and R2 have different files-sizes

Never depend on file sizes as a QC metric. File sizes should only be used for if the data is present (file not empty). Depending on sequence, files will compress to different sizes and there will be size differences even if the files have same exact number of reads/bases.

ADD COMMENT

Login before adding your answer.

Traffic: 1872 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6