Hi,
Can someone please point me to a tutorial or workflow to analyze SMART-Seq data from stranded single-cell RNA?
As I am not sure how to map the data or do the downstream analysis.
thanks in advance
Assa
Hi,
Can someone please point me to a tutorial or workflow to analyze SMART-Seq data from stranded single-cell RNA?
As I am not sure how to map the data or do the downstream analysis.
thanks in advance
Assa
Smart-seq is a full-length but non-UMI method so there is nothing special when dealing with these data in terms of preprocessing compared to regular RNA-seq. Use tools of your choice, for example salmon or STAR+featureCounts to create your count matrix. The only "thing" that one has to know is that usually each single cell gets its own fastq file so unlike 10x Chromium there is no need for explicit demultiplexing of cells based on any cellular barcode sequences. Once you have the count matrix you can use the OSCA workflow from Bioconductor or Seurat vignettes, both in R, or ScanPy in Python to get started with analysis.
Three lines of code using kallisto | bustools
:
pip install kb_python
kb ref -i index.idx -g t2g.txt -f1 cdna.fa genome_fasta.fa genome_annotations.gtf
kb count -x bulk -i index.idx -g t2g.txt -o output_dir/ --strand forward --parity paired --tcc batch.txt
Where batch.txt is a file that looks something like the following:
cell1 cell1_r1.fastq.gz cell1_r2.fastq.gz
cell2 cell2_r1.fastq.gz cell2_r2.fastq.gz
cell3 cell3_r1.fastq.gz cell3_r2.fastq.gz
And where you supply genome_fasta.fa and genome_annotations.gtf (which you can download online for your specific species of interest)
You'll get a set of matrix files that you can then import into a standard downstream workflow (e.g. scanpy).
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Do one need to use the "standard" STAR tool, or should I use the STARsolo?
I was also wondering, as I have only around 100 cells in the analysis, can I use DESeq for downstream analysis?
You can use STARsolo -- it has an option to supply a file (similar to kallisto's batch.txt file below) where you include your cell IDs and the FASTQ file paths.
Nothing wrong with using STAR either -- STARsolo is probably more convenient to use than STAR for hundreds of cells.
You can use DESeq. For scRNA-seq, I generally see more classic statistical methods (like t-test, Mann-Whitney, etc.) used after people load their data into scanpy/Seurat. But DESeq works.
I would use normal STAR as in my understanding dolo is a CellRanger-like reimplementation, yet you do not need these features towards barcode and UMI correction etc. Just run standard STAR and then featureCounts. A little Nextflow wrapper can help parallelizing, or GNU paralle or a Makefile, as you like. I prefer limma when analyzing single cell data as it scales better eith many cells. Pseudobulkes are preferred if you have biological replicates, else either limma or the Wilcox test scored well in recent benchmarks (Soneson et al Nature Something (forgot the exact journal)).
The advantage of using STARsolo is that it can output multiple matrices (e.g. two matrices: spliced/unspliced) suitable for additional downstream scRNA-seq analysis (
--soloFeatures Gene GeneFull SJ Velocyto
). There are STARsolo options can disable barcode stuff, UMI stuff, etc. and there's a workflow designed specifically for smartseq data.Kallisto | bustools (w/ kb_python) can also currently do those additional workflows for smart-seq data (but fyi the devel [non-stable] version, which is fully functional right now, greatly improves upon the current stable release in terms of speed/memory/accuracy). The advantage of kallisto | bustools over STAR+featureCounts is if you want isoform-level resolution (and proper handling of multi-mapping) as was done in one of the BICCN Nature papers (for this case, STAR+RSEM also works as does salmon).
Good arguments towards solo, I did not know!
I guess you mean this one here
thanks I'll have a look and try both options for the sake of completeness
I have tried to run it with
--soloType SmartSeq
, but it keeps crashing with segmentation fault message.I have increased the threads number to 20 and the
ulimit
to 5000, but it still keeps crashing when starting to map.any ideas why this happens?
thanks
How much memory do you have available?
Thanks I was able to run it. it was a memory problem, but I don't really understand why, as the server supposedly have 1Tb RMA
It might be the number of open files the server can handle at one point, but I honestly can't imagine I have more than 5000 files open when STAR is running.
ok, I got STARsolo to run. Do I understand it correctly that the output is one big file (bam, counts, mtx,...) which contains all the cells, so all fastq files in one go? I don't have to loop over all the fastq files one after the other.
thanks
Correct -- just make sure all those files were included (e.g. if you have 1000 paired-end files, make sure you have that many dimensions in your mtx).
another short question, if it is ok here. When running
fastp
to remove adapter and do QC, I also get some unpaired reads which I saved in single-end fastq files for each of these pairs. This I would like to also map withSTARsolo
(as a separate run for single-end samples). Is it possible to add these counts after wards to the counts from the paired-end run? Does it even make sense to consider these samples as well?thanks
I'm sure there's a way to map those reads and add those counts into your existing counts, but that would be an unnecessarily complicated (and likely error-prone) workflow. Just discard those reads or just run STARsolo on the entire thing without fastp trimming.