Dear BioStars,
Aim: Creating a cohort of synthetic normalized RNAseq samples to create a global control cohort to use in differential expression analyses against test cases without/ with uncertain controls.
Rationale: Running DESeq2 to contrast our cohort (n=1000s) vs test cases (n=10-100s) would be too slow. We automate our processes and this is inefficient.
Ideal outcome: Confidently creating a synthetic database using a few parameters per gene e.g. dispersion and mean from the real-world data.
Here I will detail the steps we currently are using to create the synthetic dataset (code free - we would desire conceptual and statistical help).
We assume datasets have genes as rows and samples as columns.
Part 1 - Normalizing cohort database
- Convert raw counts to counts per million (helps with contrasting data from different batches/ experiments)
- Use DESeq2 to normalize all samples (trusted method)
- Extract normalized DESeq2 counts
- Psuedo log counts (log2 + 0.001)
- Curb overestimated outliers. If a value in a row is 6 times the row mean, then this value is fixed at 6 times the mean.
Part 2 - Getting Negative binomial parameters
- Multiply the logged normalise counts by 1000, then change to integers
- Run each gene/ row individually through a negative binomial model
- Extract a k/ dispersion and a mu / mean value per gene and divide by 1000
Part 3 - Creating the synthetic cohort
- Using rnbinorm we construct negative binomially distributed data. k = dispersion per gene, mu = mean value per gene, test_case is the test dataset we wish to contrast against the synthetic cohort. Exponential calculations needed on the k and mu values to account for the logging in step 4.
synthetic_cohort <- suppressWarnings(sapply(names(mu), function(gene) {
rnbinom(n = ncol(test_case), size = exp(k[[gene]]), mu = exp(mu[[gene]]))
}))
Part 4 - Performing Differential Expression Analysis
- Clean and combine the test_case (normalised in the same way as in part 1) and the synthetic cohort.
- Run a simple contrast between the test case and the synthetic cohort in DESeq2.
Current thoughts:
I am not sure about using the means/ mu values from the negative binomial distributions. I think using mean values from the normalized data (from part 3) leads to more realistic values.
I do think the data might be getting too normalised. Logging, multiplying by 1000, then performing exponentials might be diluting signal.
Any thoughts or advice in the steps would be very useful.
That sounds utterly unusual and non-standard. How many covariates do you have in the model? DESeq2, while I agree is not super fast, is at least reasonably fast, and if you like you can always use limma as a linear approach to get faster estimates, or use glmGamPoi in DESeq2 for speedup.
ATpoint yes - I agree that it is indeed very non-standard. DESeq2 is used twice - at part 1, point 2 and part 4 point 2. In the first the design is based on each individual sample. All samples are from the same tissues, and used very similar methods to measure them. Adding the individual experiments as a covariate could cause co-linearity issues I think. In the latter use a straight contrast between the synthetic cohort and the test case.
Part1, point 2 - runId = unique sample name
Part 4, point 2 - condition column is binary: Test_case or Synthetic_Cohort
Moving on to the next part of your comment, it is not just running speed of the contrasts, but also the time it would take to download the data. Furthermore, if we have 10s of contrasts to make in one analysis then the time consumption of running DESeq2 would add up. Overall using the current real-world global database we have would not be the optimal solution.
I cannot say I totally follow here, but always be aware that a reviewer could ask to demonstrate that your method is robust. How are you going to address this? Usually, new methods need some benchmarking against established standards. You have my comment and the answer of Gordon Smyth (the driving force behind egdeR, limma and other software packages) which both discourage you to go along with that. I would really recommend to follow the advise to use limma-trend and extract the contrasts you need. If you have some sort of confounding in your experimental design that is linear with your main condition then no simulation or trick will remove that anyway.