Hi everyone,
We performed a large bulk RNAseq experiment on an iPSC - line. Cells were treated and harvested in 96 well plates and automatically processed following the SmartSeq2 protocol by a core unit. We have roughly 60 conditions randomly assigned throughout the plate and multiple control wells / plate in a blocked experimental design (Each sample only appears once at the same position / plate). We have four replicates/sample.
Pre-processing: bcl2fastq -> STAR -> HTSeq. FASTQC shows high quality reads for all samples.
For the analysis I only include coding genes, remove genes with rowsum < 500 (with a total of ~ 360 samples) which leads to a matrix with ~14k genes. I use DESeq2 for differential expression analysis. I include surrogate variables from SVA as covariates to the model (SVA, n_sv = 4), to get rid of an initially present batch effect.
After running DESeq2 I investigate the MA-plots and make the observation that conditions with increased cell death (cells were counted in 96 well plates before harvested fro SmartSeq) show an abnormal MA plot:
Other samples with unchanged viability show a balanced, expected MA plot
I took the following steps to investigate the situation:
I extracted the signature from the MA-plots by "manual gating" and could confirm the the genes that appear as the upregulated population are mostly the same across different apoptotic conditions. Pathway annotation shows the "Apoptosis" hallmark pathway from msigdb significantly enriched in this gene population.
I thought about possible contamination, but the apoptotic conditions match what we expect and are randomly distributed over the 96well plates, which makes a contamination extremely unlikely.
I looked at the sample to sample correlation plots. interestingly I do see slight deviations of the "manually extracted" group of consistently upreglated genes, but this phenomenon does not correlate with viability (or the presence of the abnormal MA plot). Batch correction nearly completely gets rid of this observation.
I ran DESeq2 without surrogate variables which also leads to a similar MA-plot pattern'
Some additional thoughts:
- The cell line we are using can be 'activated' and exist in different states. Could a different state of activity in a cell induce such a MA-plot pattern?
- Considering the fact that this is an iPSC-line, could potential de-differentiation account for such an MA-plot pattern?
My question:
Is there a valid biological explanation for this observation?
What happens when you remove the SVA covariates?
I did rund DESeq2 without surrogate variables, the MA-plot abnormalities were still present.
Exvessive logFC with low baseMean is usually due to poor or no prefiltering. I personally find rowsum/rowmean-based filtering suboptimal. Do it group-aware, see for example https://support.bioconductor.org/p/9152700/#9152701
Thanks for your input! I will definetly look into that. But as you can see from the MA plot, not only low baseMean genes are affected here. Hence I don't think I'll be able to get rid of this just by adjusting the pre-filter.