cellranger count expects a certain nomenclature for the fastq files, please see the last section here, "My FASTQs are not named like any of the above examples".
Basically this is how your file names should look like: [Sample Name]_S1_L00[Lane Number]_[Read Type]_001.fastq.gz.
For the Read Type, you can take a look at your fastq files with head to see what is what. The link above explains different read types.
Thanks for your reply.
Upon closer inspection, I think the fastq files I downloaded has been modified, i.e it does not look like a normal fastq format.
@Haci is only referring to the file name format. As posted this is normal fastq format. But since reads are probably dumped from SRA without -F option the fastq header has been modified to contain that SRR number. Your best option is to re-extract the data from SRA file with -F option or try to get the fastq files from ENA.
I don't think the -F option would be an issue, as it only effects the sequene/read name. --split-files, on the other hand, is critical, a typical 10x run has 2 or 3 fastq outputs, all of which are expected by cellranger count with the "right" filename (not read name) conventions.
Do you have a space in --sample= SRR8526547 after the =? If so remove that. That directive is not needed if you have one sample. So you could try omitting it.
As far as I can tell, the pipeline did not start either. One thing you can check is the extension, cellranger count would expect fastq.gz, just like your original files. If that would be the error, the software would have complained with an error though!
I used fastq-dump --split-files to download srr, it gives me three files, (with size 1.2G, 174Mb, 390Mb), how do I know which file is which lane or left or right to rename the files to run cellranger?
If I already downloaded some files without split-files, can I still use them? or I have to redownload them?
If this is 10x data then one of the smaller files (should be re-named R1) will contain cell barcodes+UMI. Other small file should have Illumina indexes (should be re-named I1). Final file should have the actual read data (largest, should be re-named as R2).
If you post the SRR# I can take a look. Sometimes these files are included in original/additional downloads without a need to figure out what is what.
@genomax, technically, R2 does not need to be the "largest" file. If longer read length is specified for R1 during the sequencing run, exceeding the cell and transcript barcodes and into the transcript, R1 can be equal to or larger than R2.
error: No input FASTQs were found for the requested parameters.
for several hours now. In my case the file names, the file path and the command were all fine. Finally what solved the issue for me was to move the fastq.gz files into a seperate folder that only contained fastq.gz files. The original folder had some other files in it (md5, fastqc output, etc.). Not sure why this was a problem for the pipeline, but make sure to give this a try if you run into similar trouble.
@Max.Ka. Many thanks for posting this. I had the same issue trying to run cellranger-atac count and was completely stumped. In a folder containing >200 fastq.gz files there was a rogue .txt file in there that prevented the the program running. My error message was this:
Completely uninformative as it only refers to the path and the fastq sample ID, the name of the text file was something completely different! That said, after reading your post, I ended up finding this note on the 10X website which mentions removing non-fastq files in point 1.
Thank you for your contributions.
Finally gotten it to work - the codes below work fine. Turns out it was the space after --sample= as genomax astutely pointed out. Hoci also made a great point about naming of the samples which must be strictly adhered.
For those who might be wondering, fastq or fastq.gz will work just fine. If you are at working directory, --fastqs=. would also work. (So far, the header of my fastq files had not produced any errors, but I'll keep updated on the output.)
Dear Haci,
Thanks for your reply. Upon closer inspection, I think the fastq files I downloaded has been modified, i.e it does not look like a normal fastq format.
The head of R1 is
The head of R2 is this
Is there a workaround?
@Haci is only referring to the file name format. As posted this is normal fastq format. But since reads are probably dumped from SRA without
-F
option the fastq header has been modified to contain that SRR number. Your best option is to re-extract the data from SRA file with-F
option or try to get the fastq files from ENA.I don't think the
-F
option would be an issue, as it only effects the sequene/read name.--split-files
, on the other hand, is critical, a typical 10x run has 2 or 3fastq
outputs, all of which are expected bycellranger count
with the "right" filename (not read name) conventions.@genomax Indeed I downloaded from ENA using
The header is the same. Do you recommend other ways of downloading so that the header is preserved?
@haci After changing the name to
SRR8526547_S1_L001_R1_001.fastq SRR8526547_S1_L001_R2_001.fastq
this command did not produced any error nor output file, just this:
Do you think the issue is the header?
Do you have a space in
--sample= SRR8526547
after the=
? If so remove that. That directive is not needed if you have one sample. So you could try omitting it.As far as I can tell, the pipeline did not start either. One thing you can check is the extension,
cellranger count
would expectfastq.gz
, just like your original files. If that would be the error, the software would have complained with an error though!I used fastq-dump --split-files to download srr, it gives me three files, (with size 1.2G, 174Mb, 390Mb), how do I know which file is which lane or left or right to rename the files to run cellranger? If I already downloaded some files without split-files, can I still use them? or I have to redownload them?
If this is 10x data then one of the smaller files (should be re-named
R1
) will containcell barcodes+UMI
. Other small file should have Illumina indexes (should be re-namedI1
). Final file should have the actual read data (largest, should be re-named asR2
).If you post the SRR# I can take a look. Sometimes these files are included in original/additional downloads without a need to figure out what is what.
@genomax, technically,
R2
does not need to be the "largest" file. If longer read length is specified forR1
during the sequencing run, exceeding the cell and transcript barcodes and into the transcript,R1
can be equal to or larger thanR2
.Fair point. I based my comment on file sizes posted by the @alan, which seem to fit normal pattern.