Pseudobulk analysis using AggregateExpression()
1
0
Entering edit mode
4 weeks ago
Jeyong • 0

I am using Seurat to analyze scRNA-seq. I used a function AggregateExpression() for pseudobulk analysis. But I suddenly remembered this function give me a biased result, because expression of the result may be proportionate to the number of cells(cell counts). AggregateExpression() returns the sum of gene expression.

I am analyzing T cells. I could get highly expressed genes in CD8 T effector mermory and CD8 T exhausetd (these subtypes have high proportion in my T cell data). I suppose this result come from cell counts. Is this right? If so, how can I treat pseudobulk?

  • I am worried that celltype1 and celltype2 have similar expression in single cell level, but have high difference of expression in bulk level because of cell counts not as other reason.

Thanks a lot for reading my question.

seurat pseudobulk sum aggregateexpression • 823 views
ADD COMMENT
1
Entering edit mode
4 weeks ago

I opened an issue on Seurat github a while ago about this, I did not get any answer.

The normalization is done within each cell, where the gene counts are normalized. What matters is not the number of cells but the number of genes you are calling in your AggregateExpression function.

I would advice you to call the function on all genes, then filter your genes of interest.

ADD COMMENT
0
Entering edit mode

Hi, Bastien Herve. Thank you for answering me again this time

I did preprocessing like normalize, ... etc on all cells and genes. Also, I use pseudobulk analysis on all cells and genes. After that, I selected genes I am interested in.

Did you mean this pseudobulk result are not biased by cell count issue, if I use all cells and genes in normalization process?

ADD REPLY
0
Entering edit mode

The question is, what do you want to use your pseudobulk matrix for ?

The aggregation function is not biased on the number of cells, it will add up gene counts for x number of cells for each category (I guess cell type in your context).

An example, let's say you have 2 cell types you are interested in like Tcells and Bcells.

With a single cell matrix with raw counts like this :

        Tcell1 | Bcell1 | Bcell2 | Tcell2 | Tcell3 | Tcell4
geneA      2   |    5   |    2   |    0   |    4   |    10
geneB      0   |    10  |    5   |    2   |    4   |    2
geneC      6   |    6   |    2   |    4   |    7   |    6

After aggregation you will end up with this matrix, whatever the number of cells, the number will just grow accordingly :

        Tcell | Bcell
geneA      16  |   7
geneB      8   |   15
geneC      23  |   8

Then, you normalize by the number of counts in each aggregated cell (meta cell). From now, you consider each cell type as a "cell".

        Tcell | Bcell
geneA   0.43  | 0.23
geneB   0.22  | 0.5
geneC   0.62  | 0.27

Note that if you do the aggregation of your counts on a subset of genes (the feature parameter in the AggregateExpression), you will not have the correct number of total counts in your meta cell, thus the raise of my issue. Do your aggregation on all genes, filter your genes of interest afterwards. You can remove cells, but not genes.

ADD REPLY
0
Entering edit mode

Thanks for your detailed explain.

I used pseudobulk because I want to use gene-set from bulk-seq data on my scRNA-seq data. I think you say cell counts have means also. So I don't need to be worried about that.

I am worried about issue like below. enter image description here

In single cell level, expression of geneA is higher in CD8_naive than CD8_exh. However, in bulk level, expression of geneA is higher in CD8_exh than CD8_naive. So If I check expression of geneA in bulk, I might think geneA is high in CD8_exh although geneA is high in CD8_naive in single cell level.

ADD REPLY
0
Entering edit mode

Not if you normalize your new "bulk cells" (aka CD8_exh and CD8_naïve) by their total transcript counts. This is what AggregateExpression is doing after the aggregation

        CD8_exh | CD8_naïve
geneA   12      | 8
geneB   26      | 6.5

CD8_exh total counts = 38

CD8_naïve total counts = 14.5

You divide each column by their total number of counts :

        CD8_exh | CD8_naïve
geneA   0.32    | 0.55
geneB   0.68    | 0.45

I think you say cell counts have means also

I never said that, count should be integers.

ADD REPLY
0
Entering edit mode

I'm sorry for late response. I understand what you said about 'AggregateExpression'.

But, I think my intention was not conveyed well.

  • I think you say cell counts have means also. : 'Cell counts' I mentioned means the number of single cells. In this situation, CD8_exh cells are 4, CD8_naive cell is 1. (not transcripts)

Can I interpret the result of pseudobulk is applying the number of cell counts and transcripts? (ex) CD8_exh - cell counts: 4, transcripts counts: 38) In this pseudobulk situation, CD8_exh has higher expression of 'geneA' than CD8_naive, right?

Thanks a lot again. I'm sorry I didn't understand well and give you questions repeatedly.

ADD REPLY
0
Entering edit mode

What do you mean by applying the number of cell counts and transcripts ? You can normalize by the number of cells if you want, rather than by the number of transcript counts. But you will lose the information of cell size (the number of transcript a cell is producing).

ADD REPLY
0
Entering edit mode

In above example

  • number of cell counts: geneA expression of CD8_exh is sum of 4 CD8_exh cells and geneA expression of CD8_naive is sum of 1 CD8_naive cells. So I think 'cell counts' effect on gene expression in pseudobulk
  • number of transcript: As you said, 'transcript counts' effect on normalization process in pseudobulk.

The reason why I kept mentioning cell numbers is because I'm afraid I'll misinterpret the geneA expression.

  • geneA expression in single cell level: CD8_exh < CD8_naive
  • geneA expresison in bulk level: CD8_exh > CD8_naive

I got a gene-set from bulk-seq data. And this gene-set enrichment score is higher in CD8_exh than CD8_naive. I'm afraid this result is because cell counts of CD8_exh is higher than CD8_naive. (I'm applying gene-set enrichment score at pseudobulk result.)

ADD REPLY
0
Entering edit mode

I don't know how to explain myself better, please read again my comment on normalization by the total number of transcripts in each meta cell, 3 comments above.

ADD REPLY
0
Entering edit mode

I will search more information and read your explanation again and again. I'm sorry about stuffiness you felt... Thanks very much.

ADD REPLY

Login before adding your answer.

Traffic: 1772 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6