Hi all,
I am working with the RNA-seq data on humans (24patients-20controls). I used DESeq2 to find differentially expressed genes.
here is the code that I used:
It is corrected for cell-type composition (using cibersort and PCA on the estimated cell-type proportions)
dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable,
directory=folder,
design=~Plate+RIN+Sex+Age+condition+PC2+PC1) #considering PC1,PC2 as covariates, to considering cell type composition as covariates
dds <- estimateSizeFactors(dds)
keep <- rowSums( counts(dds) >= 10 ) >= 20
dds <- dds[keep,]
colData(dds)$condition <- relevel(colData(dds)$condition, ref = "Control")
dds<- DESeq(dds)
resultsNames(dds)
resLFC <- lfcShrink(dds, coef="condition_patints_vs_Control", type="apeglm")
vsd <- vst(dds, blind = FALSE)
Here is my PCA plot:
data <- plotPCA(vsd, intgroup= c("condition","Sex"), returnData=TRUE)
percentVar<- round(100 *attr(data, "percentVar"))
ggplot(data, aes(PC1, PC2, color= condition, shape=Sex))+ geom_point(size=3)+
labs(x=paste0("PC1: ", percentVar[1], "% variance"),
y=paste0("PC2: ", percentVar[2], "% variance"))
By considering the following cut-off:
res <- resLFC [abs(resLFC $ log2FoldChange) >= 1 &
(resLFC $padj < 0.05),]
Before removing the outliers:
Number of upregulated genes is: 251
Number of downregulated genes is: 253
But after removing those three outliers in the PCA:
Number of upregulated genes is: 14
Number of downregulated genes is: 9
May you please guide me and give some explanation about whether is it normal to lose a lot of genes after removing outliers? should I keep those samples or remove them?
What is the explanation for that much difference between the number of differentially expressed genes? (those 3samples are coming from the same sites and they are patients)
ps: sample size after removing 3 outliers: 21patints- 20controls
I checked the alignment log files for that three samples. the overall alignment rate for them was above 90% like the rest of the samples. (I used HISAT2 for alignment and HTSeqcount for generating the count files)
Thank you in advance!
Also, there is another thing:
when I don't consider PC1,PC2 as covariates (to don't consider cell type composition as covariates); it doesn't look that much difference between the number of significant genes.
(The cut-off is the same as before)
without considering PC1,PC2 as covariates:
Before removing the outliers - without PC1,PC2 as covariates:
Number of upregulated genes is: 392
Number of downregulated genes is: 254
But after removing those three outliers - without PC1,PC2 as covariates:
Number of upregulated genes is: 524
Number of downregulated genes is: 251
considering PC1,PC2 as covariates:
Before removing the outliers - considering PC1,PC2 as covariates:
Number of upregulated genes is: 345
Number of downregulated genes is: 242
But after removing those three outliers - considering PC1,PC2 as covariates:
Number of upregulated genes is: 14
Number of downregulated genes is: 9
It made me really confused. any ideas, please?
Hi, that may be important, I was also wondering about your formula. Could you explain why you included them? They are not experimental variables, so maybe they should not be included?
Do you mean about the PC1,2 in my design matrix?
Then, how do I have to adjust my differentially expressed genes for cell type proportions? How my design formula should be to consider PC1-2 as a covariate?