Hi,
I have a single cell RNA-seq dataset consisting of 10 samples (5 cluster, 2 days) that I aggregated creating pseudobulk RNA seq matrix for downstream analysis.
The problem is that after data aggregation using , I generated a matrix (more than 25000 genes, I show only ten):
tab[,1:10]
Xkr4 Gm1992 Gm19938 Gm37381 Rp1 Sox17 Gm37587 Mrpl15 Lypla1 Tcea1
A_1 100.3983403 2.094996 8.0508725 0.4740132 11.7446134 50.5215017 0.8604795 541.788327 204.612026 866.087779
A_2 3.7714064 0.000000 0.0000000 0.0000000 0.0000000 2.9878470 0.0000000 20.446530 11.856091 36.950256
B_1 26.3195842 1.942903 0.4094375 0.0000000 0.0000000 9.5934363 0.0000000 246.942944 129.289099 470.376530
B_2 4.5693646 1.343612 2.4839935 0.0000000 1.9287930 1.3162788 0.0000000 131.049776 58.809368 202.904801
C_1 3.7508510 1.285488 1.2854877 0.0000000 0.0000000 1.7040939 0.0000000 6.820192 1.231842 8.273079
C_2 65.0411039 5.427881 3.6083771 0.0000000 0.6973232 2.9113581 0.0000000 391.903116 166.712614 553.954149
D_1 0.4635404 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 3.897851 3.142519 10.284732
D_2 21.5049262 1.197321 3.1446334 0.0000000 0.4733959 0.7565153 0.0000000 229.970330 104.761129 346.445247
E_1 66.9866197 2.038161 5.2060244 0.0000000 6.3575404 28.5580183 1.3065041 91.232942 42.103075 127.664432
E_2 0.4813143 0.000000 0.0000000 0.0000000 0.0000000 5.1772589 0.0000000 18.813332 8.889335 13.623546
As u can see, expression profile in each sample is different, because the number of cells for each sample is different:
NUMBER OF CELLS:
A_1 A_2 B_1 B_2 C_1 C_2 D_1 D_2 E_1 E_2
1322 56 733 416 16 1004 19 637 226 30
How can I manually normalize data, obtaining values in a range from 0-1? thanks
Hi, thank you for your response. What I was trying is to construct a MDS plot to show the similarity between clusters. in this regard, some colleagues told me that the samples must be normalized, otherwise the mds plot depicts proximity based on the number of cells (the more cells, the higher the expression value). They also told me that when I aggregate the data it would be more appropriate to use the function
fun = mean
instead of the sum:I know the approach with sum rather than average as sum preserved the integer nature and distribution of counts which averages do not. See for example the OSCA workflow from Bioconductor http://bioconductor.org/books/3.15/OSCA.multisample/multi-sample-comparisons.html#creating-pseudo-bulk-samples
I understand that it is recommended to use the sum instead of the average, which I have done so far. The problem is that until now, in this way, I haven't been able to manually get a decent mds graph, which mirrors the one obtained using a deseq2 package. for this I can not understand where is the error in the normalization of my data.