Hi,
I have raw counts from multiple different single cell RNAseq experiments from different sources (different sequencing technologies etc). I need to generate a matrix of normalized counts for every experiment such that they are relatively similar to one another for a downstream ML exercise.
What is the recommended way/tools to use to normalize data like this? One could argue the datasets and even individual cells are not directly being compared to one another as they will be classified, but I need a reasonable level of normalization across datasets generated from different technologies!
thank you
Do the different datasets contain roughly the same celltypes or is this rather a mishmash of experiments?
its a bit of both, some experiments will contain similar cell types and others certainly contain a mish-mash...
I think you will have a hard time trying to tweak the data for your purposes. Single-cell assays can suffer from severe batch effects between experiments. It is possible to integrate assays, with tools such fastMNN, but this usually results in corrected values in PCA space which can then be used to generate a unified clustering landscape. Both the obtained PCA-space values and the corrected like "counts" if you will are not recommended for anything but visualization as the integration procedure creates dependencies between the data and can even change directions of counts and negative values. I was asking for the data composition as one might try to regress the different experiments to get corrected counts but this probably only makes sense if the data between experiments are actually the same and only the "study" factor is the confounding event. Sure, you can run any of the standard normalization techniques, be it the TPM or more elaborate and singlecell-specific ones such as the deconvolution method from
scran
or model-based normalizations such as thesctransform
variance stabilizing transformation on your datasets but the strong confounding will remain. There is probably a good change that your results will suffer from the batch effects. It is often not easily possible to just collect unrelated experiments and pretend they could be pressed into a meaningful analysis as if confounding would not be present. Maybe running whatever analysis you want on the individual datasets and then combine results later, would that be an option?thank you for the detailed response! all of this make perfect sense.
Just to answer your last question - no, it needs to be combined - it will be used for training a classifier which will require all data - I think in the absense of any established way of doing this, and as the comment below suggests something similar- I will opt for TPM normalisation (or VST normalisation as this is something I am currently using for another project - although all batches contain relatively similar samples) - I will likely compare both methods and see how they affect downstream results..
I am wondering how confounding batch may be if proportionally higher transcripts are commonly high across different cells - the data will ultimately be log transformed and then scaled to between a smaller range - this is why I am mor einclined to use a VST to avoid outlier values that could also mess with the squashing of value later on!
The vst is for UMI data though, you would need to check whether all platforms have that. I would use the method in
scran
though tbh as it corrects for compositional changes which you almost certainly will have as different single-cell platforms have different ways of capturing and processing the transcripts (end-tagged vs full length), therefore in full-length longer transcripts will have inherently higher counts than shorter ones at equal expression level plus the plate-based technologies generally have higher depth per cell but fewer cells overall compared to droplet-based technologies. I doubt that something as simple as TPM will do a good job, there is to my knowledge no benchmark that (either for bulk or single-cell) has ever explicitely recommended a simple per-million technique that corrects only for depth rather than composition. But as said, best to try and compare.thank you I will do this - if all are UMI - VST is viable and one that I would probably prefer (but still compare to other approaches)
I am not familiar with the Scran approach so will read the documentation now! May I ask what the specific method is in scran (as in the function so I can read up on this fully and see how it is working) - presumably I will get back a matrix of normalised counts?
Thank you for the comments re: TPM - I have read similar comments elsewhere which is what prompted to ask this question to see if there were better/established ways.
thank you for all your time, this has been extremely helpful!
Check http://bioconductor.org/books/release/OSCA/normalization.html#normalization-by-deconvolution
It is an awesome read for basically everything related to scRNA-seq with regard to the Bioconductor universe.