I am using Seurat to analyze scRNA-seq. I used a function AggregateExpression() for pseudobulk analysis. But I suddenly remembered this function give me a biased result, because expression of the result may be proportionate to the number of cells(cell counts). AggregateExpression() returns the sum of gene expression.
I am analyzing T cells. I could get highly expressed genes in CD8 T effector mermory and CD8 T exhausetd (these subtypes have high proportion in my T cell data). I suppose this result come from cell counts. Is this right? If so, how can I treat pseudobulk?
- I am worried that celltype1 and celltype2 have similar expression in single cell level, but have high difference of expression in bulk level because of cell counts not as other reason.
Thanks a lot for reading my question.
Hi, Bastien Herve. Thank you for answering me again this time
I did preprocessing like normalize, ... etc on all cells and genes. Also, I use pseudobulk analysis on all cells and genes. After that, I selected genes I am interested in.
Did you mean this pseudobulk result are not biased by cell count issue, if I use all cells and genes in normalization process?
The question is, what do you want to use your pseudobulk matrix for ?
The aggregation function is not biased on the number of cells, it will add up gene counts for x number of cells for each category (I guess cell type in your context).
An example, let's say you have 2 cell types you are interested in like Tcells and Bcells.
With a single cell matrix with raw counts like this :
After aggregation you will end up with this matrix, whatever the number of cells, the number will just grow accordingly :
Then, you normalize by the number of counts in each aggregated cell (meta cell). From now, you consider each cell type as a "cell".
Note that if you do the aggregation of your counts on a subset of genes (the
feature
parameter in theAggregateExpression
), you will not have the correct number of total counts in your meta cell, thus the raise of my issue. Do your aggregation on all genes, filter your genes of interest afterwards. You can remove cells, but not genes.Thanks for your detailed explain.
I used pseudobulk because I want to use gene-set from bulk-seq data on my scRNA-seq data. I think you say cell counts have means also. So I don't need to be worried about that.
I am worried about issue like below.
In single cell level, expression of geneA is higher in CD8_naive than CD8_exh. However, in bulk level, expression of geneA is higher in CD8_exh than CD8_naive. So If I check expression of geneA in bulk, I might think geneA is high in CD8_exh although geneA is high in CD8_naive in single cell level.
Not if you normalize your new "bulk cells" (aka CD8_exh and CD8_naïve) by their total transcript counts. This is what
AggregateExpression
is doing after the aggregationCD8_exh total counts = 38
CD8_naïve total counts = 14.5
You divide each column by their total number of counts :
I never said that, count should be integers.
I'm sorry for late response. I understand what you said about 'AggregateExpression'.
But, I think my intention was not conveyed well.
Can I interpret the result of pseudobulk is applying the number of cell counts and transcripts? (ex) CD8_exh - cell counts: 4, transcripts counts: 38) In this pseudobulk situation, CD8_exh has higher expression of 'geneA' than CD8_naive, right?
Thanks a lot again. I'm sorry I didn't understand well and give you questions repeatedly.
What do you mean by
applying the number of cell counts and transcripts
? You can normalize by the number of cells if you want, rather than by the number of transcript counts. But you will lose the information of cell size (the number of transcript a cell is producing).In above example
The reason why I kept mentioning cell numbers is because I'm afraid I'll misinterpret the geneA expression.
I got a gene-set from bulk-seq data. And this gene-set enrichment score is higher in CD8_exh than CD8_naive. I'm afraid this result is because cell counts of CD8_exh is higher than CD8_naive. (I'm applying gene-set enrichment score at pseudobulk result.)
I don't know how to explain myself better, please read again my comment on normalization by the total number of transcripts in each meta cell, 3 comments above.
I will search more information and read your explanation again and again. I'm sorry about stuffiness you felt... Thanks very much.