How to Estimate total variance in, and between RNA-seq datasets?
0
0
Entering edit mode
3.6 years ago
Gabriel ▴ 170

I have an scRNA-seq dataset, and I want to look at the proportional variance between "samples" or even different datasets, batches, and so on.

Some people do batch-effect corretion, and then they show a bar-plot of "percent explained variance by batch".

I want to do something similar, that is, find the percent explained variance, between different conditions and comparisons.

My protocol so far, is similar to finding R in linear regression:

  • Step 1: SSbetween = Find sum of squares for all samples
  • Step 2: SSwithin= Find sum of squares within samples
  • Step 3: % variance explained = SSbetween - SSwithin / SSbetween or something similar.

The problem is that for each sample, there are 20.000 genes, each with their own variance. So how do I estimate the total variance of all genes and a group of samples.

I know how to do it for one gene, this is simple the sum( mean - xi )² where xi is the expression of the gene in sample i, but since there are many genes, each has their own variance. How do I calculate the total sample variance for all genes?

The simplest would be to sum them, but this would skew the variance for a few outlier samples with high expression / variance. What is the standard way to estimate group variance in batch correction or similar situations?

batch variance correction • 1.5k views
ADD COMMENT
0
Entering edit mode

Can you link a reference for this?

ADD REPLY
0
Entering edit mode

I don't have any references / protocols. Just taking inspiration from how % explained variance is calculated in PCA and Linear Regression

ADD REPLY
0
Entering edit mode

Some people do batch-effect corretion, and then they show a bar-plot of "percent explained variance by batch".

I was referring to that sentence. What you probably mean is the % variance explained by each principal component, no?

ADD REPLY
0
Entering edit mode

No. See f.ex. here:

https://www.biorxiv.org/content/biorxiv/early/2020/10/28/2020.10.27.358283/F3.large.jpg

https://www.biorxiv.org/content/10.1101/2020.10.27.358283v1.full

% explained variance by batch is generally mentioned in papers on batch correction.

" We first considered the proportion of variance explained by treatment and batch effects before and after batch correction across all variables using pRDA. Efficient batch correction methods should generate data with a smaller proportion of batch associated variance and larger proportion of treatment variance compared to the original data. "

It seems they are using limma- removeBatchEffect and ComBat . ComBat returns % explained variance by batch but I don't understand how they calculate the total variance because they first calculate variance per each gene

ADD REPLY

Login before adding your answer.

Traffic: 1594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6