Hi everyone,
I've recently started analyzing single-cell RNA-seq data (with FASTQ files as a starting point) and so far I have used 10x genomics data from their website.
Now, I'm interested in using data generated by other protocols, specifically SMART, because it is the most used full-length protocol (the two main paradigms are tag-based like 10x and full length). However, I'm having trouble understanding the raw data and I figured that it would be worth discussing the differences between FASTQ files from 10x and SMART-seq. Both methods are sequenced in Illumina sequencers, which depending on the model, yield a different number of files, but it's always one set of files. What about SMART-seq? is that the protocol where there's one set of files for each cell?
To further complicate matters, I understand that full-length protocols (SMART-seq2) -unlike tag-based protocols- do not support UMIs, but SMART-seq3 does use UMIs and I had the idea (I read it in some paper) that when you are sequencing full-length transcripts having UMIs is really not a factor that changes anything. So how does the analysis between SMART-seq2 and SMART-seq3 change to account for this?
Thank you!
Thank you, that was a great answer. So before smart-seq3 the data was inflated? since without UMIs there was no way to correct for the PCR duplicates
Correct, smart-seq2 doesn't have UMIs so there was no way to correct for PCR bias. This was why smart-seq3 was developed.
As for how big of a difference PCR bias makes, that's a whole other discussion entirely. All RNAseq library preps introduce many sources of technical biases (PCR, length, coverage, capture bias, sequence-specific biases, etc.) and how these various biases affect downstream analyses is an entire field of research on its own!
There is actually a way to correct for PCR duplicates - same way as it's one with WGS and any other method. Basically each "stack" of reads aligned to same region could be collapsed and counted as 1 read. However, this is probably over-correcting things, since strongly expressed genes would have their expression lowered artificially (there's a high chance that two of the same reads actually came from different cDNA fragments).
Many methods do use that approach, but they're ineffective: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4933-1
There are a few considerations, i.e. do the reads that originate from different positions actually come from different molecules? This won't be the case when the Tn5 step happens post-PCR (and you generally need more PCR cycles in single-cell methods).
Well I wouldn't say they are "ineffective" - it's just an overkill. The link you've shared is for bulk/miRNA; there were many earlier papers that basically PCR deduplication for bulk RNA-seq reduces power to discover differentially expressed genes, and should not be used. In general, I think consensus in the field was that you don't really need UMIs for bulk, unless it's really low input protocol. For single cell (and especially for single-end reads, like most 10x) UMIs become a lot more crucial. And if you can't collapse reads based on UMIs, simple deduplication should still be better than nothing.
I might be wrong, but I think current methods (e.g. Cell Ranger, STARsolo, etc) would not collapse reads that map to different positions while having the same UMI.
Also, are there methods that do PCR first and then transposase?
I've been working with others on a new single-cell experimental protocol in lab; in the first iteration, we didn't ligate on UMIs. Indeed, our results looked much better when I collapsed reads (and looked more consistent with the subsequent refinement when we did a UMI ligation reaction). So yes, I agree it's better, especially for low complexity libraries.
For current read mapping software, they WILL collapse reads that have the same UMI but with different positions. This is because current methods only really care about uniquely-mapped reads (so if two identical UMIs map to different positions within the same gene, they will be collapsed since there's a very low probability that they represent two distinct molecules). (Note: For many tools, these are settings that can be adjusted by the user). Again, I'm referring to different positions "within the same gene" (if it's different genes, well, of course you wouldn't collapse).
The smart-seq methods do PCR then fragment; I assume you mean the opposite: fragment (immediately after RT) then PCR? Yes, many protocols (especially for bulk) do that [however, for tagmenting, the advantage of Tn5 post-PCR is being able to add adapter at the same step]. If you're using UMIs, just make sure you ligate on the UMIs before PCR (in some protocols like 10x, the UMIs will identify individual RNA molecules; in other protocols, the UMIs will tag individual cDNA fragments).
Can assure there is also length bias with PCR on bulk RNA samples. Lengths of RNA span a very broad range and the small RNAs will always out compete large even with as few cycles as you can get away with. I seen this with as few as 5 cycles and that was a pretty high cDNA input.