Hi All,
I have been using DESeq2 to perform differential gene expression analysis between two groups but I do not have house keeping genes in the data set. Is it still possible to do DESeq calculations without the house keeping genes? and I do not have all the genes expressed across all subjects/samples.
I have the following error message calculating size factors:
dds <- DESeq(dds)
estimating size factors
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, :
every gene contains at least one zero, cannot compute log geometric means
Any help would be appreciated.
Thank you.
Gajender
DESeq2, by default, does not use or need housekeeping genes.
Bizarrely, I have come across this error before, but it's very rare. It occurs literally when each of your transcripts contains at least 1 zero. To detect these in a dataset, I wrote the following complex one-liner in R:
First make sure that you remove transcripts that just don't have any counts:
Then:
Explanation (the logic can be tough):
[1], First convert the entire data-frame to TRUE or FALSE for 0 or non-zero
[2], It then applies the table function per row, which gives TRUE and FALSE tallies per gene
[3], It then checks if any tally equals the total number of samples - as we've already eliminated genes with all 0 values, the only condition that can meet ncol(MyData) is the FALSE (non-zero) condition. This produces a further TRUE or FALSE for each gene and condition.
[4], Finally, if any TRUE values are present, then we know that at least 1 row has all non-zeros, and therefore we can proceed
I'd say that's a little long for a one-liner ;-)
I have a 2 liner, if it is of interest. Lets say df is your counts dataframe and your entire expressed gene across samples. Remove based on rowSums and filter.
Just because you can proceed with only a handful of genes with no zero values does not mean you should. The poster needs to know if the problem is a few dropped samples, or a whole bunch. There might be simpler ways to assess the QC of the fastq or the alignment. Even eyeballing the size of the fastq files might tell the poster what they need to know.
I agree to that but OP need to clarify if the QC reports of fastq was problematic or not or let's say the alignment. Problems can be a lot but if the OP is starting with exploratory analysis on counts then filtering is not to be ruled out. Only if it reduces the expression to a lot then it is big reason to worry else not. This is fine provided OP clarifies the FASTQC and alignment reports. Since it was based on DESeq2 and genes expressed so we suggested such
Thank you very much for sharing the script. That's a great help.
Another question - is it possible that some genes are turned on exclusively in healthy but not in diseased subjects and vice-versa. Is there a way to look for such genes as well?
Yes, I have seen that in breast cancer where, for example, a non-coding RNA can show very high expression in cancer but no (or virtually nil) expression in healthy controls. These would appear through differential expression analysis and have very low statistically significant adjusted P values and high absolute log [base 2] fold changes.
For sure, I see that! Thank you for answering my question.
Great - have a nice day / evening
It is a scenario which happens for non-coding RNAs but also true for protein-coding mRNA transcripts as well if the library is not done well or library enrichment is not in accordance with the kind of regions one is looking for. Another problem is are you quantifying across whole genome or only transcriptome? Likely scenario is for the whole genome, you will have a lot of genes/transcripts that will not have any expression across most samples. It is also seen, at least I have seen when I quantify for the only transcriptome and still, there are dropouts for 5-8% of genes across all samples which I filter out. That can be not only biological but also technological and technical dropouts.