Question

Accessing SRA file with multiple fastqs under one accession

0

Entering edit mode

8 months ago

tony_88888 • 0

Hi,

I am trying to download some fastq files from SRA to align with cellranger. The accession number I am using is GSE246613. The issue I am having is that under many of the accessions for example SRR26540978, there are multiple reads from different lanes. I have attempted to download these accessions with sratoolkit with the following commands:

fasterq-dump SRR26540978 --split-files # (output is 2 fastq files)
fasterq-dump SRR26540978 #(output is 2 fastq files)
fasterq-dump SRR26540978 --concatenate-reads # (output is a single fastq file)
fastq-dump SRR26540978 --split-files # (output is 2 fastq files)

If I re-name the files to SRR26540978_S1_L003_R1_001 and SRR26540978_S1_L003_R2_001 to be compatible with cellranger I get the following error:

    Log message:
    Unable to distinguish between [SC5P-PE, SC3Pv2] chemistries based on the R2 read mapping for Sample SRR26540978 in "/data1/fastq_files".
    Total Reads          = 100000
    Mapped reads         = 78070
    Sense reads          = 21725
    Antisense reads      = 20316

In order to distinguish between the 3' vs 5' assay configuration the following conditions need to be satisfied:
- A minimum of 1000 confidently mapped reads
- A minimum of 5.0% of the total reads considered needs to be confidently mapped
- The number of sense reads need to be at least 2x compared to the antisense reads or vice versa

Please validate the inputs and/or specify the chemistry via the --chemistry argument.

I have attempted to download the fastq files directly from ENA however, the data has not yet been updated on there and it was only recently released.

Does anyone have any idea how to download the following fastq files using SRA toolkit? There are many accessions with similar format that i require as well:

fastq   1   2023-10-27  7.5GB   AWS s3://sra-pub-src-14/SRR26540978/FT-SA11620_S13_L003_R1_001.fastq.gz.1   -   Use Cloud Data Delivery
fastq   1   2023-10-27  7.4GB   AWS s3://sra-pub-src-14/SRR26540978/FT-SA11620_S13_L003_R2_001.fastq.gz.1   -   Use Cloud Data Delivery
fastq   1   2023-10-27  16.5GB  AWS s3://sra-pub-src-14/SRR26540978/FT-SA13903_S3_L004_R1_001.fastq.gz.1    -   Use Cloud Data Delivery
fastq   1   2023-10-27  15.9GB  AWS s3://sra-pub-src-14/SRR26540978/FT-SA13903_S3_L004_R2_001.fastq.gz.1    -   Use Cloud Data Delivery

Once I am able to download all of the required fastqs I understand how to concatenate the sequencing lanes into one file and then align that file with cellranger.

Any assistance would be much appreciated.

sratoolkit fastq sra • 1.3k views

ADD COMMENT • link updated 8 months ago by GenoMax 146k • written 8 months ago by tony_88888 • 0

0

Entering edit mode

Does anyone have any idea how to download the following fastq files using SRA toolkit?

try https://nf-co.re/fetchngs/1.11.0 ?

ADD REPLY • link 8 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

My recommendation is to use download links from sra-explorer.info for the Bioproject Number PRJNA1032700 and get links for every fastq file.

CellRanger does not need files to be concatenated. It does it internally, if the fastqs per sample are in the same folder and follow the CellRanger naming convention, see its manual.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE246613 also tells that the samples with the "_P" suffix are 5' v1 chemistry, so you can tell CellRanger what it is.

ADD REPLY • link 8 months ago by ATpoint 84k

0

Entering edit mode

download links from sra-explorer.info for the Bioproject Number PRJNA1032700 and get links for every fastq file.

There are 266 samples in this project with 2.5+ TB of data so that is something to keep in mind. This data also seems to have been made public this month so ENA has likely not processed it so sra-explorer is not generating fastq links (at least in the time I looked at it).

Data also appears to have been sequenced as non-standard 151 x 151 bp. No original cellranger BAM's are available. So overall par for course for 10x data in SRA. Messy.

fastq-dump may be the best option to get the data for now.

ADD REPLY • link 8 months ago by GenoMax 146k

0

Entering edit mode

Thanks for the suggestion. I've tried going to https://sra-explorer.info/ --> PRJNA1032700 --> any accession and Raw Fastq Download URLs. It seems to just load for a very long time (>1hr) and still not output with URLs. It does state 'To download FastQ files directly, sra-explorer queries the ENA for each SRA run accession number.' I'm wondering if this is because the files are not actually available on ENA?

I've run cellranger again formatting the chemistry=fiveprime and it seems to be running so far. Hopefully that works but I am assuming I will lose a number of reads unless somehow the 2 files sratoolkit fasterq-dump generated somehow combined the multiple reads from different lanes.

Could I also please confirm how you knew "_P" suffix samples were 5' chemistry?

Thanks

ADD REPLY • link 8 months ago by tony_88888 • 0

0

Entering edit mode

I will lose a number of reads unless somehow the 2 files sratoolkit fasterq-dump generated somehow combined the multiple reads from different lanes.

You won't lose reads as long as you are running cellranger with all files in the same directory and correct options (file naming convention: https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/inputs/cr-specifying-fastqs) . L003/L004 in the file name shows that the reads came from two lanes of a NovaSeq 6000 S4 flowcell.

Could I also please confirm how you knew "_P" suffix samples were 5' chemistry?

If you go to one of the samples https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM7872698 then you will see extraction protocol description.

Single-cell gene expression and paired TCR (CD45-positive only) librarieswere prepared according to the 10X Genomics Single Cell 5’ v1 protocol.

ADD REPLY • link 8 months ago by GenoMax 146k

0

Entering edit mode

Thanks for that.

As fasterq-dump only generated 2 fastq files (- SRR26540978_R1 and SRR26540978_R2) I ran cellranger on those 2 fastqs only by setting the chemistry to fiveprime.

I wasn’t sure if there was a way to tell if they were the lane003 or lane004 reads. Some of the other SRRs have even more lane’s and reads. So that’s why I think I may be missing reads. Is that correct?

ADD REPLY • link 8 months ago by tony_88888 • 0

score 0 · Answer 1 · 2024-01-26

0

Entering edit mode

8 months ago

GenoMax 146k

Some of the other SRRs have even more lane’s and reads

Depending on amount of sequence needed some samples may have been sequenced on more than one lane to get more reads.

I don't recall if these dumps are recovering original Illumina read headers (you will need to use -F option when you run fastq-dump to get those). If read headers are original Illumina format then you should be able to tell what lane the data is from.

ADD COMMENT • link 8 months ago by GenoMax 146k

0

Entering edit mode

Thanks for your help. i've managed to proceed and download a lot of the other accessions in the dataset and they seem to be aligning fine with cellranger without having to specify chemistry.

Using -F with fastq-dump it appears that the accessions are just the reads from one lane, usually lane1.