I have RNA-seq data for 20 samples with 2 condition and 2 sex (male, female : control, treatment). I am very new to RNA-seq analysis and am trying to find the DEGs using DESeq2. Since I want to have the normalization to be calculated based on all samples, I get the rld based on all and then I will use contrast to find the DEG for 1. tratment vs. control male 2. tratment vs. control female.
For the QC, the PCA plot separates the male and female but the control and treatment is not separated very well on the PC2 for male C1 (Figure 1)
Then I decided plot PCA only for male samples. again PC1 does not separate the control and treatment male samples but PC2 is kinda separating them. (Figure2)
My questions are:
What numbers on the PCA plot (x and y axis) decide the separation? I read in a website that samples that are at PC1>0 are outlier. is that true?
Can I just look at figure 1 and remove male c1 and continue DEG analysis with 9 samples? or I should definately plot figure 2 ?
If I need to consider Figure 2, can I rely on only the PC2 which separates control and treatment and continue the DEG analysis or I should remove C1, tr4, tr5 samples and then work on DEG analysis based on remaining 7 samples?
Hi Kevin,
Thanks for your response. I believe it is not a sample mix up (but I am gonna check it again). I do not know it is a male with Klinefelter Syndrome. How can I check it?
For the normalization, I actually made a 3rd column in my sample table and grouped together the sex and condition.
then the dds object as:
Then I used contrasts for further analysis:
The result of the pca plot is from the above code. I think making the "group" column is including the sex in the design and is similar to ~sex+condition design. is that right? is the code above what you meant by including sex?
By segregating male | female, do you mean to run the analysis for male and female samples separately due to the major difference between the male/female samples?
Ah, I see...
sex
is already included ingroup
. When you look at the results of the differential expression comparison, do they make sense to you?Out of curiosity, take a look at the output of this:
The results actually make sense with the 10 samples but when I remove C1, I get 2times more DEG genes.
Changing to Blind=FALSE, I get the following PCA plot:
C1,tr4,tr5 are kinda different compared to other samples.
Thanks - in that case, it just seems like a genuine result with regard to that sample (C1), unless there was indeed a mix-up. As I don't know your area of study, I cannot comment much further; however, occasionally, controls are not quite what they seem!
There is no right or wrong answer here. You can choose to leave it in your dataset or exclude it. Either way, you have to record what you do.
Thank you Kevin, I checked and it is not a sample mix up. I will continue with further analysis and see what makes sense more based on the results.