Question

normalize RNA seq pseudobulk after matrix creation

0

Entering edit mode

2.2 years ago

Chironex ▴ 50

Hi,

I have a single cell RNA-seq dataset consisting of 10 samples (5 cluster, 2 days) that I aggregated creating pseudobulk RNA seq matrix for downstream analysis.

The problem is that after data aggregation using , I generated a matrix (more than 25000 genes, I show only ten):

tab[,1:10]
           Xkr4   Gm1992   Gm19938   Gm37381        Rp1      Sox17   Gm37587     Mrpl15     Lypla1      Tcea1
A_1 100.3983403 2.094996 8.0508725 0.4740132 11.7446134 50.5215017 0.8604795 541.788327 204.612026 866.087779
A_2   3.7714064 0.000000 0.0000000 0.0000000  0.0000000  2.9878470 0.0000000  20.446530  11.856091  36.950256
B_1  26.3195842 1.942903 0.4094375 0.0000000  0.0000000  9.5934363 0.0000000 246.942944 129.289099 470.376530
B_2   4.5693646 1.343612 2.4839935 0.0000000  1.9287930  1.3162788 0.0000000 131.049776  58.809368 202.904801
C_1   3.7508510 1.285488 1.2854877 0.0000000  0.0000000  1.7040939 0.0000000   6.820192   1.231842   8.273079
C_2  65.0411039 5.427881 3.6083771 0.0000000  0.6973232  2.9113581 0.0000000 391.903116 166.712614 553.954149
D_1   0.4635404 0.000000 0.0000000 0.0000000  0.0000000  0.0000000 0.0000000   3.897851   3.142519  10.284732
D_2  21.5049262 1.197321 3.1446334 0.0000000  0.4733959  0.7565153 0.0000000 229.970330 104.761129 346.445247
E_1  66.9866197 2.038161 5.2060244 0.0000000  6.3575404 28.5580183 1.3065041  91.232942  42.103075 127.664432
E_2   0.4813143 0.000000 0.0000000 0.0000000  0.0000000  5.1772589 0.0000000  18.813332   8.889335  13.623546

As u can see, expression profile in each sample is different, because the number of cells for each sample is different:

NUMBER OF CELLS:

A_1  A_2  B_1  B_2  C_1  C_2  D_1  D_2  E_1  E_2 
1322   56  733  416   16 1004   19  637  226   30

How can I manually normalize data, obtaining values in a range from 0-1? thanks

scRNA-seq pseudobulk r single-cell • 2.7k views

ADD COMMENT • link updated 18 months ago by Ram 45k • written 2.2 years ago by Chironex ▴ 50

Ram · Answer 1 · 2023-03-12

1

Entering edit mode

2.2 years ago

ATpoint 88k

In my head more cells is (effectively) nothing different than having more reads in the bulk so the standard methods as implemented in DESeq2 and edgeR will be just fine. That is at least how I use pseudobulks. As a side note, your count matrix is non-standard. It is a consensus in the field to have columns as samples/cells and rows as features/genes. This is the format mentioned tools accept. If you want relstive expression you could run vst from DESeq2 followed by Z-transformation (t(scale(t(vst)))). Don't normalize raw counts between 0 and 1, that's imo pointless. What's the analysis goal?

If you're worried that results might be not reliable due to very different cellnumbers you could subsample to lowest cellnumber and repeat analysis, then see whether results from full analysis can be reproduced.

ADD COMMENT • link 2.2 years ago by ATpoint 88k

0

Entering edit mode

Hi, thank you for your response. What I was trying is to construct a MDS plot to show the similarity between clusters. in this regard, some colleagues told me that the samples must be normalized, otherwise the mds plot depicts proximity based on the number of cells (the more cells, the higher the expression value). They also told me that when I aggregate the data it would be more appropriate to use the function fun = mean instead of the sum:

pb <- aggregateData(seurart_obj,
  assay = "logcounts", fun = "mean", scale = FALSE,
  by = "sample_id")

ADD REPLY • link updated 18 months ago by Ram 45k • written 2.2 years ago by Chironex ▴ 50

0

Entering edit mode

I know the approach with sum rather than average as sum preserved the integer nature and distribution of counts which averages do not. See for example the OSCA workflow from Bioconductor http://bioconductor.org/books/3.15/OSCA.multisample/multi-sample-comparisons.html#creating-pseudo-bulk-samples

ADD REPLY • link 2.2 years ago by ATpoint 88k

0

Entering edit mode

I understand that it is recommended to use the sum instead of the average, which I have done so far. The problem is that until now, in this way, I haven't been able to manually get a decent mds graph, which mirrors the one obtained using a deseq2 package. for this I can not understand where is the error in the normalization of my data.

ADD REPLY • link 2.2 years ago by Chironex ▴ 50