I am currently working on a single-cell data analysis project, and I am facing a challenge regarding the aggregation of single-cell data into pseudobulks for input into the GSVA software. GSVA only accepts a gene X subject matrix, which means that pseudobulks must be created to facilitate this input. I have come across two different approaches to the aggregation process and I am unsure of which one to use.
In a recent paper by Blanchard et al., pseudobulk counts were aggregated after normalizing and log-transforming the data. The authors computed normalized gene expression profile averages first, using ACTIONet, and then obtained individual-cell-type-level aggregated expression profiles. On the other hand, a single-cell tutorial suggests aggregating raw counts first, followed by normalization and log transformation. This step is important because the gaussian kernel I intend to use in GSVA software only accepts continuous expression data in logarithmic scale and RNA-seq log-CPMs, log-RPKMs, or log-TPMs units of expression.
I am unsure which approach to take. Should I normalize and log-transform the data first before aggregation, or should I aggregate first before normalization? I would greatly appreciate any guidance or insights on this matter.
Thanks!
My pipelines are designed to run in python, so would this be similar to running
decoupler.get_pseudobulk()
and then runningscanpy.pp.normalize_total()
andscanpy.pp.log1p()
in python using the decoupler and scanpy packages?