Hi,
I've been using sratoolkit for a while now but still get confused by the output at times. For example, I am trying to download the accession SRR12386358. This is paired end data and looking at the 'data access' tab it looks like they have deposited the data correctly with fastqs for read 1 and 2
link to accession: https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&page_size=10&acc=SRR12386358&display=data-access
When I use fastq-dump --split-files --readids
I get 3 files output. Please see headers in the picture attached. Could someone please explain to me which each one of these files is? Is there a way to get these files in a format that I can use for cellranger? I have tried --split-3
but the output is a single file and similar output using fasterq-dump
.
Thank you very much for the response.
Do you have any idea why I got that output and not the expected read 1 and read 2? Can these files still be used as they are in cellranger or do I need to try and download them again with sratoolkit?
You can use files 2 and 3 with cell ranger.
Thanks GenoMax, I imagine this problem may occur again with future accessions.
I have some understanding of reading fastq headers but I've never came across telling the difference between the Illumina barcode for the sample and the Cellbarcode + UMI. Could you please explain how to tell the difference between these 2?
When entering these into cellranger should I put _3 (rna read) as read 1 and _2 (cellbarcode +UMI) as read 2 or does it not really matter?
Thank you very much
Read 1 is standard Illumina index. It will be short like the 8 bases here. Illumina indexes are not used by
cellranger
. They are only used for demultiplexing. Read 2 is the Cellbarcode + UMI since depending on type of kit it is read as 26 or 28 bp. It could also be the same length as RNA read in some submissions.cellranger
will use the right number of bases required (26 or 28).When I try and run this on cellranger I get the following error message:
[error] pipestance failed: Error log at: SRR12386358/SC_RNA_COUNTER_CS/SC_MULTI_CORE/MULTI_CHEMISTRY_DETECTOR/DETECT_COUNT_CHEMISTRY/fork0/chnk0-u86d2a17bf/_errors
Log message: FASTQ header mismatch detected at line 4 of input files "SRR12386358_S1_L001_R1_001.fastq" and "SRR12386358_S1_L001_R2_001.fastq", line: 4
This is using _2 and _3
Any idea why that may be occurring GenoMax
Thanks
Did you do something to the files e.g. scan/trim them independently? If not, it is possible that your files are out of sync and/or corrupt. You can try
repair.sh
from BBMap suite to bring them back in sync. Or redownload.Edit: Tested a small sample of reads from your accession and things worked without issues with
cellranger
.I did not do anything to the files. I corrected this problem by using -F with fastq dump. When I took a look at line 4 of the input files the SRR IDs were slightly different. All good now but thanks for getting back to me.