Before doing pseudobulk, should we get rid of genes that are only expressed in a small number of cells?
1
0
Entering edit mode
4 months ago
mropri ▴ 160

Hi,

I know I had a question about pseudobulk normalization earlier. I had done differential expression between two conditions for my cell type of interest. I had aggregated counts for each gene for all cells per sample and then differential expression using DESeq2, Once I had gotten my pseudobulk matrix, I had filtered out genes where more than 90% of the values are less than 5. My results produced genes like PTPRC and others that had log2 fold change of 2 or greater. But when I look at how many cells this (or other genes with log2 fold change of 2 or greater) have counts greater than 0, it is sparse compared to total cell count (as shown in the image). So this got me wondering, how relevant this could be biologically? And would it be better to remove these genes by setting a threshold that genes have to be present in this many cells before pseudobulking? I ask this because small changes can be seen as significant (if in control sample the gene is expressed at 3 counts vs in disease it is expressed in 6 counts) versus when genes are expressed at a higher level, small or even medium size changes in counts will not be as significant or have a large log fold change (in control the gene is expressed at 2000 counts vs in disease it is expressed at 3500 counts). The gene expressed at higher counts with changes between control and disease might have more biological relevance than genes going from 3 to 6 counts. Would like to have some guidance in this and if removing these genes with some criteria is worthwhile. I know each biological question and dataset is different but just trying to get general guidance. Lasty, in the image, the first two rows are control samples, while the rest are disease samples. Appreciate any help. enter image description here

DESeq2 scRNAseq Pseudobulk • 830 views
ADD COMMENT
0
Entering edit mode
4 months ago

That depends partly on the biology of the experiment you are working with. PTPRC is CD45 and cells positive for that marker could be indicative of a small amount of immune cells in your samples. Obviously if this is not a possibility in your experiment then you can think about alternative reasons this might be a thing.

ADD COMMENT
0
Entering edit mode

Thank you for the help. These cells I am confident are not immune cells, and this is bolstered by the small number of cells expressing any counts for PTPRC of CD34 (which I have not shown). I think in this case it would be good before pseudobulking to take genes that are expressed in 10 or maybe even 20 percent of cells to highlight ones that are more prominent within this cell type. This will help look for differentially expressed genes that are more biologically relevant.

ADD REPLY
1
Entering edit mode

It's important to note that DESeq2 doesn't need you to prefilter any genes and it's generally better to leave stuff in and then subsequently filter the results. You can create a filter after differential expression testing for specific cell numbers / LFC etc.

ADD REPLY
0
Entering edit mode

This is very helpful. I can filter genes that only appear in a small percentage of cells after doing DESeq2 on all of them. Then can run GO or GSEA analysis on genes that are differentially expressed and expressed at or above a certain percentage of cells. Thank you!

ADD REPLY
0
Entering edit mode

The question I have now is, if I use all genes to do differential expression and then filter vs. filter genes and then do differential expression, the log2 fold change and p values will be different because of the different size factors, dispersion, and multiple hypothesis testing. Which one would you recommend would be better, to do differential expression with all genes and then filter or filter first and then do differential expression?

ADD REPLY
0
Entering edit mode

You're still selecting top N genes in the end. Does it matter that much if you get 2353 differential genes instead of 2536 - the top 50/100/200 genes will still be similar? If you're doing something like GSEA, the enrichment is still based on a continuous statistic and permutation. My advice would be run a form of the analysis and look for the kind of data you get out at the end - does it make sense? Are you able to find significant differeces between the groups and do they seem to have a biological basis. Then you can re-evaluate. The good thing about these methods is they don't take a long time to run.

ADD REPLY
0
Entering edit mode

Sounds good, thank you for your help

ADD REPLY

Login before adding your answer.

Traffic: 1914 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6