I have an scRNA-seq dataset, and I want to look at the proportional variance between "samples" or even different datasets, batches, and so on.
Some people do batch-effect corretion, and then they show a bar-plot of "percent explained variance by batch".
I want to do something similar, that is, find the percent explained variance, between different conditions and comparisons.
My protocol so far, is similar to finding R in linear regression:
- Step 1: SSbetween = Find sum of squares for all samples
- Step 2: SSwithin= Find sum of squares within samples
- Step 3: % variance explained = SSbetween - SSwithin / SSbetween or something similar.
The problem is that for each sample, there are 20.000 genes, each with their own variance. So how do I estimate the total variance of all genes and a group of samples.
I know how to do it for one gene, this is simple the sum( mean - xi )² where xi is the expression of the gene in sample i, but since there are many genes, each has their own variance. How do I calculate the total sample variance for all genes?
The simplest would be to sum them, but this would skew the variance for a few outlier samples with high expression / variance. What is the standard way to estimate group variance in batch correction or similar situations?
Can you link a reference for this?
I don't have any references / protocols. Just taking inspiration from how % explained variance is calculated in PCA and Linear Regression
I was referring to that sentence. What you probably mean is the % variance explained by each principal component, no?
No. See f.ex. here:
https://www.biorxiv.org/content/biorxiv/early/2020/10/28/2020.10.27.358283/F3.large.jpg
https://www.biorxiv.org/content/10.1101/2020.10.27.358283v1.full
% explained variance by batch is generally mentioned in papers on batch correction.
" We first considered the proportion of variance explained by treatment and batch effects before and after batch correction across all variables using pRDA. Efficient batch correction methods should generate data with a smaller proportion of batch associated variance and larger proportion of treatment variance compared to the original data. "
It seems they are using limma- removeBatchEffect and ComBat . ComBat returns % explained variance by batch but I don't understand how they calculate the total variance because they first calculate variance per each gene