Entering edit mode
4.0 years ago
A. Domingues
★
2.7k
Hi all,
I am trying to find cell markers to distinguish population A and B using single-cell RNA-seq data publicly available. The snag is that these populations where identified in different studies, and the data is available as raw counts (10x) for one study and TPMs (Smart-seq*) for another.
Any suggestion how to integrate these datasets to perform DE downstream?
I was considering using seurat
and SCTransform
. Any objections?
*I think. It is not clear from the paper's methods but they sequenced the library with 75PE reads.
Yes, sctransform fits its model on the UMI raw counts, not on TPM. You probably cannot do what you plan to do. If it is published can't you download the raw data and then process it? Yes, that is cumbersome but trying to tweak TPM and raw counts into one analysis is imho not only inappropriate but also a waste of time since results will not at all be reliable even if you technically get any results out of it. Alternatively, email the authors and ask for a raw count matrix. If you have that you could integrate them, but integration requires that at least some populations are being shared between studies so anchors (or whatever method you use) can be found. Random integration (like two completely different populations from different studies) is probably not going to be reliable.
Cheers @ATpoint. This is what I feared. Cheers. I will have to go back to the drawing board.
Just to add another note after doing some more research, Seurat doesn't recommend using
SCTransform
values for differential expression. So the sctransform is not even necessary for this.Yes, that is true. The reason you run SCtransform in the integration context is to select features, it was not clear to me whether you want to integrate or not. DE would be typically done on the raw counts which then are being run through appropriate frameworks such as edgeR, but the problem with the batch effect stands.