I know someone posted similar topics before (rna seq replicates not clustering ). However, here, I want to prompt some discussions on this topic with given plots and examples here.
I have triplicates for three conditions (Ev. Mut and WT) in mouse cells. I followed standard ways people usually used to analyze RANseq data: alignment (STAR) -> raw count (HTseq with intersection-strict mode) -> TMM normalization on edgeR -> VSN to 'disassociate dependence of the variance on the mean intensity' -> use VSN value for exploratory analysis (QC).
I did both PCA and pairwise-correlation clustering to find 'outlier'.
- PCA plot:
- pairwise-correlation heatmap:
From PCA plot, I can generally tell one sample from each condition (wt_rep2, ev_rep1 and mut_rep1) are obviously outliers. pairwise-correlation confirms that outlier. However, the rest of samples are not obviously clustered based on conditions. Yes, those data are totally messed up.
Now here are three options I plan to proceed (with questions):
- delete outliers and use rest samples for DE analysis. Since I have only 3 replicates, how could I just two samples for DE analysis?
- treat outlier and normal clustering as batch effect, incorperate batch effect as covariant in design.matrix (~ condition + batch), even though I have no idea what possible condition-independent factors cause them as outliers. Is this valid?
- Ignore QC, go ahead with DE analysis. Then what is the point to do a QC exploratory analysis?
Just to help others, is it likely that there is a batch effect for those samples to the left of the PCA bi-plot? How does your PCA bi-plot look if you just use 'my' code: A: PCA plot from read count matrix from RNA-Seq ?
PC1-5 scatterplots: https://ibb.co/nB1Bvp PC1-3 bi-plots: https://ibb.co/iLKaMU
I would not go by CPM. I would analyse the samples in DESeq2 and produce regularised log counts. Prior to normalisation, I would remove transcripts whose mean raw count was <10 or <15. Then, regenerate the plots.
If you proceed to DEA with the current samples, then I imagine that many transcripts would fail Cook's test for outliers based on having large variance. Your PC1 variance is >40%, which is quite substantial.
This is not CPM. It is vsn output. It is like rlog in DESeq2
Why is cpm in your PCA bi-plot titles? VSN is not like rlog in DESeq2
It is just a name. I called vsn as vsnCPM. The plots were made from vsn output. VSN is variance stabilizing method, which is what rlog was trying to do. but rlog is less sensitive to sample size
I did remove low exp genes before normalization. Actually I tried to PCA top 50% genes, the results are very similar. I have very low depth on samples. After mapping and filtering, I have only 2M ~ 9M (3M on average) reads left per sample. When I was doing the low exp genes, I just made sure top 3 samples have at least 1 mapped read per sample. This filtered out 40% genes I called. Do you think it would change if I increase my expression filtering cut off?
You have small sample size and shallow sequencing, it is no surprise there is a lot of noise. Does the "outliers" correlate with sequencing depth?
Well, only three samples, I cannot say correlation. I do observe outliers have relatively more read count. Here are read count per sample
What is the % variation on your PC1?
Sequencing depth does not appear to be the sole reason, as Th2KO_wt_rep1 also has high depth but groups fine with the other samples.
If you have still not sorted this out, can you show your code and also state clearly the source of the samples and any errors that occurred during sample processing in the wet-lab. You never clearly stated this, from what I can see. One cannot introduce a batch variable for no logical reason, of course.