Differences between SMART-seq2, SMART-seq3, and 10x
1
2
Entering edit mode
3.2 years ago
hamarillo ▴ 80

Hi everyone,

I've recently started analyzing single-cell RNA-seq data (with FASTQ files as a starting point) and so far I have used 10x genomics data from their website.

Now, I'm interested in using data generated by other protocols, specifically SMART, because it is the most used full-length protocol (the two main paradigms are tag-based like 10x and full length). However, I'm having trouble understanding the raw data and I figured that it would be worth discussing the differences between FASTQ files from 10x and SMART-seq. Both methods are sequenced in Illumina sequencers, which depending on the model, yield a different number of files, but it's always one set of files. What about SMART-seq? is that the protocol where there's one set of files for each cell?

To further complicate matters, I understand that full-length protocols (SMART-seq2) -unlike tag-based protocols- do not support UMIs, but SMART-seq3 does use UMIs and I had the idea (I read it in some paper) that when you are sequencing full-length transcripts having UMIs is really not a factor that changes anything. So how does the analysis between SMART-seq2 and SMART-seq3 change to account for this?

Thank you!

10x UMI smartseq single-cell • 10k views
ADD COMMENT
10
Entering edit mode
3.2 years ago
dsull ★ 6.9k

Smart-seq data, unlike 10X data, is oftentimes deposited in a demultiplexed format, meaning each cell gets one set of FASTQ files. In 10X data, yes, there's just one set of FASTQ files but somewhere within the FASTQ files is a barcode sequence that can help you resolve each individual cell.

The advantage of Smart-seq is you get better coverage across transcripts (for 10X, you're only sequencing the 3' end which can make isoform resolution analysis difficult in many cases). Also Smart-seq sequences fewer cells so each cell can get higher sequencing depth (i.e. more reads per cell). The advantage of 10X is, as you noted, the UMIs.

Smart-seq3 is a newer version of Smart-seq that indeed uses UMIs. Basically, you're going to have one set of FASTQ files (there's no demultiplexing) but you can use barcodes to resolve individual cells and some of the reads will contain UMIs and other reads will not contain UMIs. The non-UMI containing reads give you better coverage across transcripts (at the expense of not having UMIs).

"when you are sequencing full-length transcripts having UMIs is really not a factor that changes anything"

This is not really true. The purpose of UMIs is to account for amplification bias. In bulk RNA-seq, amplification bias is not really present but in single-cell RNA seq, it is something to be concerned about because you have lower amounts of starting material (which requires many rounds of PCR, and that's where amplification bias comes in).

In any case, for Smart-seq3, you're going to have your UMI-containing reads (where you can collapse UMIs and analyze like you would 10X data) and you're going to have your non-UMI-containing reads (where you can't collapse UMIs and instead have to proceed with your analysis starting from raw read counts). Again, both types of reads give you different types of information (one has better length coverage and one accounts for amplification bias better).

ADD COMMENT
0
Entering edit mode

Thank you, that was a great answer. So before smart-seq3 the data was inflated? since without UMIs there was no way to correct for the PCR duplicates

ADD REPLY
1
Entering edit mode

Correct, smart-seq2 doesn't have UMIs so there was no way to correct for PCR bias. This was why smart-seq3 was developed.

As for how big of a difference PCR bias makes, that's a whole other discussion entirely. All RNAseq library preps introduce many sources of technical biases (PCR, length, coverage, capture bias, sequence-specific biases, etc.) and how these various biases affect downstream analyses is an entire field of research on its own!

ADD REPLY
0
Entering edit mode

There is actually a way to correct for PCR duplicates - same way as it's one with WGS and any other method. Basically each "stack" of reads aligned to same region could be collapsed and counted as 1 read. However, this is probably over-correcting things, since strongly expressed genes would have their expression lowered artificially (there's a high chance that two of the same reads actually came from different cDNA fragments).

ADD REPLY
0
Entering edit mode

Many methods do use that approach, but they're ineffective: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4933-1

There are a few considerations, i.e. do the reads that originate from different positions actually come from different molecules? This won't be the case when the Tn5 step happens post-PCR (and you generally need more PCR cycles in single-cell methods).

ADD REPLY
0
Entering edit mode

Well I wouldn't say they are "ineffective" - it's just an overkill. The link you've shared is for bulk/miRNA; there were many earlier papers that basically PCR deduplication for bulk RNA-seq reduces power to discover differentially expressed genes, and should not be used. In general, I think consensus in the field was that you don't really need UMIs for bulk, unless it's really low input protocol. For single cell (and especially for single-end reads, like most 10x) UMIs become a lot more crucial. And if you can't collapse reads based on UMIs, simple deduplication should still be better than nothing.

There are a few considerations, i.e. do the reads that originate from different positions actually come from different molecules? This won't be the case when the Tn5 step happens post-PCR (and you generally need more PCR cycles in single-cell methods).

I might be wrong, but I think current methods (e.g. Cell Ranger, STARsolo, etc) would not collapse reads that map to different positions while having the same UMI.

Also, are there methods that do PCR first and then transposase?

ADD REPLY
0
Entering edit mode

I've been working with others on a new single-cell experimental protocol in lab; in the first iteration, we didn't ligate on UMIs. Indeed, our results looked much better when I collapsed reads (and looked more consistent with the subsequent refinement when we did a UMI ligation reaction). So yes, I agree it's better, especially for low complexity libraries.

For current read mapping software, they WILL collapse reads that have the same UMI but with different positions. This is because current methods only really care about uniquely-mapped reads (so if two identical UMIs map to different positions within the same gene, they will be collapsed since there's a very low probability that they represent two distinct molecules). (Note: For many tools, these are settings that can be adjusted by the user). Again, I'm referring to different positions "within the same gene" (if it's different genes, well, of course you wouldn't collapse).

The smart-seq methods do PCR then fragment; I assume you mean the opposite: fragment (immediately after RT) then PCR? Yes, many protocols (especially for bulk) do that [however, for tagmenting, the advantage of Tn5 post-PCR is being able to add adapter at the same step]. If you're using UMIs, just make sure you ligate on the UMIs before PCR (in some protocols like 10x, the UMIs will identify individual RNA molecules; in other protocols, the UMIs will tag individual cDNA fragments).

ADD REPLY
0
Entering edit mode

Can assure there is also length bias with PCR on bulk RNA samples. Lengths of RNA span a very broad range and the small RNAs will always out compete large even with as few cycles as you can get away with. I seen this with as few as 5 cycles and that was a pretty high cDNA input.

ADD REPLY

Login before adding your answer.

Traffic: 1802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6