Hi, I am trying to use the 10x genomics cloud analysis tool (https://cloud.10xgenomics.com) to analyze a published single nucleus RNAseq dataset. I am getting an error when trying to upload the files that seems to be related to the formatting of the fastq files themselves, but am not sure how to solve it. Here's what I am doing:
Pull data for a single sample from GEO:
$ fastq-dump --split-files --gzip SRR12623869
This produces two files, SRR12623869_1.fastq.gz and SRR12623869_2.fastq.gz. I renamed them to fit illumina format SRR12623869_S1_L001_R1_001.fastq.gz and SRR12623869_S1_L001_R2_001.fastq.gz
Then, I use txg to upload them to 10x cloud analysis, for example:
$ ./txg fastqs upload --project-id <myProjectId> ~/pathToFolderContainingFastqFiles/
This produces the error message:
target "SRR12623869_S1_L001_R1_001.fastq.gz" is not a valid FASTQ file (could not parse flowcell ID: not a valid fastq)
target "SRR12623869_S1_L001_R2_001.fastq.gz" is not a valid FASTQ file (could not parse flowcell ID: not a valid fastq)
The top of the unzipped files is formatted like this:
$ head -n 20 SRR12623869_S1_L001_R1_001.fastq
@SRR12623869.1 1 length=26
GGTTGTAGTTGCCAATCCATTGCGTA
+SRR12623869.1 1 length=26
FFFFFFFFFFFFFFFFFFFFFFFFFF
@SRR12623869.2 2 length=26
GAGGGTATCACTCACCTCCTTCTTAG
+SRR12623869.2 2 length=26
FFFFFFFFFFFFFFFFFFF:FFFFFF
@SRR12623869.3 3 length=26
ATCCATTGTATTTCGGGATCACATGC
+SRR12623869.3 3 length=26
FFFFFFFFFFFFF:FFFFFFFFFFFF
@SRR12623869.4 4 length=26
CTCCATGTCGTCCTTGTTAGTTGTCA
+SRR12623869.4 4 length=26
FFFFFFFFFFFFFFFFFFFFFFFFFF
@SRR12623869.5 5 length=26
ATTACCTTCGAGTACTATAACTTCCC
+SRR12623869.5 5 length=26
FFFFFFFFFFFFFFFFFFFFFFFFFF
Anyone have a suggestion on how to address this? Thank you!
Thank you for the suggestion. I am still getting the same error after downloading with the --origfmt option. Maybe this particular dataset was uploaded without flowcell IDs?
I very recently encountered the exact same problem while attempting to upload to the 10X cloud, using a totally different dataset. I could also not find a way around it. However, when I ran cell-ranger counts locally on the fastqs (without 10X cloud), the program completed just fine and the output matrices/web summary look reasonable to me. I am guessing the problem is specific to the file validation process during the 10X cloud upload. Not a specific solution for you, but I hope this is somewhat helpful.
The message seems to be specific. Looks like it wants a flowcell ID to be present. You could try creating a fake one. But then it may actually want the full header. Unfortunately it appears that submitter's did not submit original Illumina headers for this data.