Hi,
I am trying to download some fastq files from SRA to align with cellranger. The accession number I am using is GSE246613. The issue I am having is that under many of the accessions for example SRR26540978, there are multiple reads from different lanes. I have attempted to download these accessions with sratoolkit with the following commands:
fasterq-dump SRR26540978 --split-files # (output is 2 fastq files)
fasterq-dump SRR26540978 #(output is 2 fastq files)
fasterq-dump SRR26540978 --concatenate-reads # (output is a single fastq file)
fastq-dump SRR26540978 --split-files # (output is 2 fastq files)
If I re-name the files to SRR26540978_S1_L003_R1_001
and SRR26540978_S1_L003_R2_001
to be compatible with cellranger I get the following error:
Log message:
Unable to distinguish between [SC5P-PE, SC3Pv2] chemistries based on the R2 read mapping for Sample SRR26540978 in "/data1/fastq_files".
Total Reads = 100000
Mapped reads = 78070
Sense reads = 21725
Antisense reads = 20316
In order to distinguish between the 3' vs 5' assay configuration the following conditions need to be satisfied:
- A minimum of 1000 confidently mapped reads
- A minimum of 5.0% of the total reads considered needs to be confidently mapped
- The number of sense reads need to be at least 2x compared to the antisense reads or vice versa
Please validate the inputs and/or specify the chemistry via the --chemistry argument.
I have attempted to download the fastq files directly from ENA however, the data has not yet been updated on there and it was only recently released.
Does anyone have any idea how to download the following fastq files using SRA toolkit? There are many accessions with similar format that i require as well:
fastq 1 2023-10-27 7.5GB AWS s3://sra-pub-src-14/SRR26540978/FT-SA11620_S13_L003_R1_001.fastq.gz.1 - Use Cloud Data Delivery
fastq 1 2023-10-27 7.4GB AWS s3://sra-pub-src-14/SRR26540978/FT-SA11620_S13_L003_R2_001.fastq.gz.1 - Use Cloud Data Delivery
fastq 1 2023-10-27 16.5GB AWS s3://sra-pub-src-14/SRR26540978/FT-SA13903_S3_L004_R1_001.fastq.gz.1 - Use Cloud Data Delivery
fastq 1 2023-10-27 15.9GB AWS s3://sra-pub-src-14/SRR26540978/FT-SA13903_S3_L004_R2_001.fastq.gz.1 - Use Cloud Data Delivery
Once I am able to download all of the required fastqs I understand how to concatenate the sequencing lanes into one file and then align that file with cellranger.
Any assistance would be much appreciated.
try https://nf-co.re/fetchngs/1.11.0 ?
My recommendation is to use download links from sra-explorer.info for the Bioproject Number PRJNA1032700 and get links for every fastq file.
CellRanger does not need files to be concatenated. It does it internally, if the fastqs per sample are in the same folder and follow the CellRanger naming convention, see its manual.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE246613 also tells that the samples with the "_P" suffix are 5' v1 chemistry, so you can tell CellRanger what it is.
There are 266 samples in this project with 2.5+ TB of data so that is something to keep in mind. This data also seems to have been made public this month so ENA has likely not processed it so
sra-explorer
is not generating fastq links (at least in the time I looked at it).Data also appears to have been sequenced as non-standard 151 x 151 bp. No original cellranger BAM's are available. So overall par for course for 10x data in SRA. Messy.
fastq-dump
may be the best option to get the data for now.Thanks for the suggestion. I've tried going to https://sra-explorer.info/ --> PRJNA1032700 --> any accession and Raw Fastq Download URLs. It seems to just load for a very long time (>1hr) and still not output with URLs. It does state 'To download FastQ files directly, sra-explorer queries the ENA for each SRA run accession number.' I'm wondering if this is because the files are not actually available on ENA?
I've run cellranger again formatting the chemistry=fiveprime and it seems to be running so far. Hopefully that works but I am assuming I will lose a number of reads unless somehow the 2 files sratoolkit fasterq-dump generated somehow combined the multiple reads from different lanes.
Could I also please confirm how you knew "_P" suffix samples were 5' chemistry?
Thanks
You won't lose reads as long as you are running
cellranger
with all files in the same directory and correct options (file naming convention: https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/inputs/cr-specifying-fastqs) .L003/L004
in the file name shows that the reads came from two lanes of a NovaSeq 6000 S4 flowcell.If you go to one of the samples https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM7872698 then you will see extraction protocol description.
Thanks for that.
As fasterq-dump only generated 2 fastq files (- SRR26540978_R1 and SRR26540978_R2) I ran cellranger on those 2 fastqs only by setting the chemistry to fiveprime.
I wasn’t sure if there was a way to tell if they were the lane003 or lane004 reads. Some of the other SRRs have even more lane’s and reads. So that’s why I think I may be missing reads. Is that correct?