Hi,
I have single-cell RNA sequencing data from similar tissue, one dataset collected with SmartSeq2 (full transcript length, no UMIs), and another dataset collected from 10X (3' end, with UMIs).
I am doing a standard log(x/n + 1) normalization for the 10X data. However, for the SmartSeq, I am unsure how to normalize the data. Should I correct for gene-length bias? When I try log(x/n +1) for SmartSeq2, I get significant differences in gene expression between 10X and Smart-Seq.
My goal is to integrate the 10X and Smart-Seq datasets and perform clustering. I'd like the two datasets to match as closely as possible before integration. I have a count matrix for each (rows are genes, columns are cells).
Basically, what is the recommended way to normalize SmartSeq2 expression data?
Thanks!
I don't believe it would be advisable to compare low-throughput and high-throughput data. The gene cover is going to be completely different, and you're going to lose information, at best.
But if you'd like to do it anyway, I suppose you would reduce your gene coverage to match what you get for your 10x data and then re-scale it, though I believe that will create abnormal patterns. I don't believe that will make up for the difference in the treatment of amplification biases, which the 10x technology does pretty well unlike smart-seq2 technology.
To go even further, I think most of the difference you're seeing comes from amplification bias. (And you likely won't be able to evaluate the source of the source of the difference beyond that.)
I agree with yhdist. I've looked into this a lot and my advice would simply be to not combine them just like yhdist said.
There are many differences (including different technical biases) between 10x and smartseq -- and, honestly speaking, I still don't know where a lot of the technical biases present in each technology arise from (I don't think anyone really knows). You could use a batch integration tool (like Harmony) but I'd recommend against it -- you won't really gain any new info and will probably just end up butchering your biological signal. I'm sure you'd get a pretty good methods paper published if you really discover an optimal way to use both technologies with one another.
What I'd recommend: Just look at them separately! You'll probably see the same cell types in both if they're from the same tissue. And, maybe with smart-seq, you can use transcript-level information to further resolve your gene-level cell types.