How do one analyze data from SMART-Seq stranded kit (v4)?
2
0
Entering edit mode
21 months ago
Assa Yeroslaviz ★ 1.9k

Hi,

Can someone please point me to a tutorial or workflow to analyze SMART-Seq data from stranded single-cell RNA?

As I am not sure how to map the data or do the downstream analysis.

thanks in advance

Assa

single-cell smart-seq • 4.3k views
ADD COMMENT
3
Entering edit mode
21 months ago
ATpoint 85k

Smart-seq is a full-length but non-UMI method so there is nothing special when dealing with these data in terms of preprocessing compared to regular RNA-seq. Use tools of your choice, for example salmon or STAR+featureCounts to create your count matrix. The only "thing" that one has to know is that usually each single cell gets its own fastq file so unlike 10x Chromium there is no need for explicit demultiplexing of cells based on any cellular barcode sequences. Once you have the count matrix you can use the OSCA workflow from Bioconductor or Seurat vignettes, both in R, or ScanPy in Python to get started with analysis.

ADD COMMENT
0
Entering edit mode

Do one need to use the "standard" STAR tool, or should I use the STARsolo?

I was also wondering, as I have only around 100 cells in the analysis, can I use DESeq for downstream analysis?

ADD REPLY
1
Entering edit mode

You can use STARsolo -- it has an option to supply a file (similar to kallisto's batch.txt file below) where you include your cell IDs and the FASTQ file paths.

Nothing wrong with using STAR either -- STARsolo is probably more convenient to use than STAR for hundreds of cells.

You can use DESeq. For scRNA-seq, I generally see more classic statistical methods (like t-test, Mann-Whitney, etc.) used after people load their data into scanpy/Seurat. But DESeq works.

ADD REPLY
1
Entering edit mode

I would use normal STAR as in my understanding dolo is a CellRanger-like reimplementation, yet you do not need these features towards barcode and UMI correction etc. Just run standard STAR and then featureCounts. A little Nextflow wrapper can help parallelizing, or GNU paralle or a Makefile, as you like. I prefer limma when analyzing single cell data as it scales better eith many cells. Pseudobulkes are preferred if you have biological replicates, else either limma or the Wilcox test scored well in recent benchmarks (Soneson et al Nature Something (forgot the exact journal)).

ADD REPLY
3
Entering edit mode

The advantage of using STARsolo is that it can output multiple matrices (e.g. two matrices: spliced/unspliced) suitable for additional downstream scRNA-seq analysis (--soloFeatures Gene GeneFull SJ Velocyto ). There are STARsolo options can disable barcode stuff, UMI stuff, etc. and there's a workflow designed specifically for smartseq data.

Kallisto | bustools (w/ kb_python) can also currently do those additional workflows for smart-seq data (but fyi the devel [non-stable] version, which is fully functional right now, greatly improves upon the current stable release in terms of speed/memory/accuracy). The advantage of kallisto | bustools over STAR+featureCounts is if you want isoform-level resolution (and proper handling of multi-mapping) as was done in one of the BICCN Nature papers (for this case, STAR+RSEM also works as does salmon).

ADD REPLY
1
Entering edit mode

Good arguments towards solo, I did not know!

ADD REPLY
0
Entering edit mode

I guess you mean this one here

thanks I'll have a look and try both options for the sake of completeness

ADD REPLY
0
Entering edit mode

I have tried to run it with --soloType SmartSeq, but it keeps crashing with segmentation fault message.

I have increased the threads number to 20 and the ulimit to 5000, but it still keeps crashing when starting to map.

STAR --runThreadN 20 --genomeDir $starIndex --sjdbGTFfile $gtf --sjdbOverhang 100 \
         --readFilesCommand zcat --readFilesIn $rawdata/$base.trimmed.R1.fastq.gz $rawdata/$base.trimmed.R1.fastq.gz \
         --soloType SmartSeq\
         --soloUMIdedup Exact --soloStrand Forward \
         --soloFeatures Gene GeneFull SJ \
         --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM \
         --outFileNamePrefix $bamFiles/$base. --soloOutFileNames $base \
         --limitBAMsortRAM 168632718037 --readMapNumber 1000 \
         --outSAMtype BAM SortedByCoordinate

any ideas why this happens?

Mar 03 12:21:12 ..... started STAR run
Mar 03 12:21:13 ..... loading genome
Mar 03 12:25:04 ..... processing annotations GTF
Mar 03 12:25:26 ..... inserting junctions into the genome indices
Mar 03 12:27:25 ..... started mapping
Segmentation fault (core dumped)

thanks

ADD REPLY
0
Entering edit mode

How much memory do you have available?

ADD REPLY
0
Entering edit mode

Thanks I was able to run it. it was a memory problem, but I don't really understand why, as the server supposedly have 1Tb RMA

$ grep MemTotal /proc/meminfo
MemTotal:       1056797276 kB

It might be the number of open files the server can handle at one point, but I honestly can't imagine I have more than 5000 files open when STAR is running.

ADD REPLY
0
Entering edit mode

ok, I got STARsolo to run. Do I understand it correctly that the output is one big file (bam, counts, mtx,...) which contains all the cells, so all fastq files in one go? I don't have to loop over all the fastq files one after the other.

thanks

ADD REPLY
0
Entering edit mode

Correct -- just make sure all those files were included (e.g. if you have 1000 paired-end files, make sure you have that many dimensions in your mtx).

ADD REPLY
0
Entering edit mode

another short question, if it is ok here. When running fastp to remove adapter and do QC, I also get some unpaired reads which I saved in single-end fastq files for each of these pairs. This I would like to also map with STARsolo (as a separate run for single-end samples). Is it possible to add these counts after wards to the counts from the paired-end run? Does it even make sense to consider these samples as well?

thanks

ADD REPLY
0
Entering edit mode

I'm sure there's a way to map those reads and add those counts into your existing counts, but that would be an unnecessarily complicated (and likely error-prone) workflow. Just discard those reads or just run STARsolo on the entire thing without fastp trimming.

ADD REPLY
1
Entering edit mode
21 months ago
dsull ★ 6.9k

Three lines of code using kallisto | bustools:

pip install kb_python
kb ref -i index.idx -g t2g.txt -f1 cdna.fa genome_fasta.fa genome_annotations.gtf
kb count -x bulk -i index.idx -g t2g.txt -o output_dir/ --strand forward --parity paired --tcc batch.txt

Where batch.txt is a file that looks something like the following:

cell1 cell1_r1.fastq.gz cell1_r2.fastq.gz
cell2 cell2_r1.fastq.gz cell2_r2.fastq.gz
cell3 cell3_r1.fastq.gz cell3_r2.fastq.gz

And where you supply genome_fasta.fa and genome_annotations.gtf (which you can download online for your specific species of interest)

You'll get a set of matrix files that you can then import into a standard downstream workflow (e.g. scanpy).

ADD COMMENT
0
Entering edit mode

STAR and salmon will also work for smart-seq data (I don't know the exact commands off the top of my head) like kallisto depending on what tool you prefer.

ADD REPLY
0
Entering edit mode

Thanks. I'll look into kb_python.

ADD REPLY

Login before adding your answer.

Traffic: 2532 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6