Question

Conceptual and statistical help in creating a synthetic RNAseq cohort from a real-world cohort of samples

0

Entering edit mode

6 months ago

K.patel5 ▴ 150

Dear BioStars,

Aim: Creating a cohort of synthetic normalized RNAseq samples to create a global control cohort to use in differential expression analyses against test cases without/ with uncertain controls.

Rationale: Running DESeq2 to contrast our cohort (n=1000s) vs test cases (n=10-100s) would be too slow. We automate our processes and this is inefficient.

Ideal outcome: Confidently creating a synthetic database using a few parameters per gene e.g. dispersion and mean from the real-world data.

Here I will detail the steps we currently are using to create the synthetic dataset (code free - we would desire conceptual and statistical help).

We assume datasets have genes as rows and samples as columns.

Part 1 - Normalizing cohort database

Convert raw counts to counts per million (helps with contrasting data from different batches/ experiments)
Use DESeq2 to normalize all samples (trusted method)
Extract normalized DESeq2 counts
Psuedo log counts (log2 + 0.001)
Curb overestimated outliers. If a value in a row is 6 times the row mean, then this value is fixed at 6 times the mean.

Part 2 - Getting Negative binomial parameters

Multiply the logged normalise counts by 1000, then change to integers
Run each gene/ row individually through a negative binomial model
Extract a k/ dispersion and a mu / mean value per gene and divide by 1000

Part 3 - Creating the synthetic cohort

Using rnbinorm we construct negative binomially distributed data. k = dispersion per gene, mu = mean value per gene, test_case is the test dataset we wish to contrast against the synthetic cohort. Exponential calculations needed on the k and mu values to account for the logging in step 4.

synthetic_cohort <- suppressWarnings(sapply(names(mu), function(gene) {
  rnbinom(n = ncol(test_case), size = exp(k[[gene]]), mu = exp(mu[[gene]]))
}))

Part 4 - Performing Differential Expression Analysis

Clean and combine the test_case (normalised in the same way as in part 1) and the synthetic cohort.
Run a simple contrast between the test case and the synthetic cohort in DESeq2.

Current thoughts:

I am not sure about using the means/ mu values from the negative binomial distributions. I think using mean values from the normalized data (from part 3) leads to more realistic values.

I do think the data might be getting too normalised. Logging, multiplying by 1000, then performing exponentials might be diluting signal.

Any thoughts or advice in the steps would be very useful.

statistics DESeq2 RNAseq • 909 views

ADD COMMENT • link updated 6 months ago by ATpoint 88k • written 6 months ago by K.patel5 ▴ 150

2

Entering edit mode

That sounds utterly unusual and non-standard. How many covariates do you have in the model? DESeq2, while I agree is not super fast, is at least reasonably fast, and if you like you can always use limma as a linear approach to get faster estimates, or use glmGamPoi in DESeq2 for speedup.

ADD REPLY • link 6 months ago by ATpoint 88k

0

Entering edit mode

ATpoint yes - I agree that it is indeed very non-standard. DESeq2 is used twice - at part 1, point 2 and part 4 point 2. In the first the design is based on each individual sample. All samples are from the same tissues, and used very similar methods to measure them. Adding the individual experiments as a covariate could cause co-linearity issues I think. In the latter use a straight contrast between the synthetic cohort and the test case.

Part1, point 2 - runId = unique sample name

dds <- DESeqDataSetFromMatrix(countData = counts, 
                                colData = met, design = ~runId)

Part 4, point 2 - condition column is binary: Test_case or Synthetic_Cohort

dds <- DESeqDataSetFromMatrix(countData = combined_counts, colData = col_data, design = ~ condition)
# Run DESeq2
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", "Test_case", "Synthetic_Cohort"))

Moving on to the next part of your comment, it is not just running speed of the contrasts, but also the time it would take to download the data. Furthermore, if we have 10s of contrasts to make in one analysis then the time consumption of running DESeq2 would add up. Overall using the current real-world global database we have would not be the optimal solution.

ADD REPLY • link 6 months ago by K.patel5 ▴ 150

0

Entering edit mode

I cannot say I totally follow here, but always be aware that a reviewer could ask to demonstrate that your method is robust. How are you going to address this? Usually, new methods need some benchmarking against established standards. You have my comment and the answer of Gordon Smyth (the driving force behind egdeR, limma and other software packages) which both discourage you to go along with that. I would really recommend to follow the advise to use limma-trend and extract the contrasts you need. If you have some sort of confounding in your experimental design that is linear with your main condition then no simulation or trick will remove that anyway.

ADD REPLY • link 6 months ago by ATpoint 88k

score 1 · Answer 1 · 2024-12-12

Sorry, this whole process makes no sense to me at all.

From a speed point of view, why would you not run a faster method like limma-voom or limma-trend instead of DESeq2 for a large dataset? limma-trend will compare 1000 vs 100 samples in less than a second, regardless of the number of contrasts you need to make. limma computes all the contrasts immediately without needing to fit any new models. (I just timed it on my laptop, and limma-trend took 0.33 of a second for a 1000 v 100 comparison.)

Theoretically, there are ways to speed limma up even further if you wanted to test one large reference cohort against thousands of test datasets. But limma-trend is fast enough that this sort of finessing would be marginal.

But the whole idea of comparing test samples against simulated data doesn't make statistical sense. Whatever you are trying to test, which you haven't entirely explained, there must be a better way to do it. There is no justification for introducing random numbers into a test procedure. You surely must be able to test your hypotheses directly on the test samples.

The normalization procedure you describe doesn't produce negative binomial counts, so measuring negative binomial parameters from them is inappropriate, but that is only one problem amongst many with the procedure.