Hello All,
I am working on RNAseq data for which one I have the Phenotype (DPN or PDPN), Gender (female or Male), Age and the batch (1st_round or 2nd_round ):
ID PiNS.ID Phenotype PiNS Gender Age batch
PINS_0112 112 PDPN PINS_0112 Female 64 1st_round
PINS_0171 171 DPN PINS_0171 Male 74 2nd_round
Basically, I want to get the differential gene expression regarding the Phenotype (DPN or PDPN). However, I did a PCA and I have a strong batch effect. So, I would like to design my analysis to get the differential gene expression for the Phenotype but also considering the batch effect (by removing it), Gender and the Ages.
So far, I designed my analysis as follow:
dds <- DESeqDataSetFromMatrix(countData = countData,colData = colData, design = ~ batch + Phenotype)
dds$Phenotype <- factor(dds$Phenotype, levels = c("DPN","PDPN"))
dds <- DESeq(dds, full=design(dds))#, reduced = ~ batch)
First, I am not sure if it is the correct way to remove/consider the batch effect? Secondly, I am not sure how to write my design to also consider the Gender and the Age!
Many thanks in advance for your help.
Kind regards, Pierre
What is your
colData
?Hi Carlo,
Thanks a lot for you answer. Really sorry, here you go:
The 2nd_round starts at the half of my design.tsv.
Thanks, Pierre
Can you post the full colData? Is age supposed to be categorical (like young/old) or continuous?
Sorry ATpoint, it seems that my answer did not appear, I reposts it!
So here is the full colData:
I got the data like that! Many thanks for your help! I am not in the field of RNAseq, I struggle a bit...
Pierre
Hi all,
I did some modifications on my design.tsv. Instead of having Age as a continuous variable, I transformed it to be categorical (I got an Error regarding the numerical value):
Then, I modified my previous code above to design my analysis to get the differential gene expression for the Phenotype but also considering the batch effect, Gender and the Ages:
After running the code, I got this warning message :
Firstly, I am not sure to understand why I got this warning message. Is it because of my design?
If not, I was wondering if my design is the correct way to consider the batch effect, Gender and Age to identify differential gene expression regarding the Phenoype (PDPN vs DPN)?
At the end, I got 31 genes that have different expressions. However, for example, when I plot the top 6 of those genes (one example):
This gene (and the others) does not appear differentially expressed between the Phenotype (PDPN vs DPN) but more randomly across the samples. So, I am not sure if it's correct and it's because I don't have different genes expressed between my Phenotype or it is because I did not design my analysis correctly!
Any help or advice will be much appreciated. Many thanks in advance. Pierre
Hi ! Your code and design look good. I think it is a good idea to use categorial rather than continuous Age factor. In the manual, it is stated that continuous covariates should only be used where a constant fold change is expected for each unit of the covariate, which probably is not the case with age.
Concerning the warning, it happens when you have rows with many zeros across your samples. You could perhaps pre-filter those, as describe here. Or as suggested in the warning, increase the number of iteration of the
nbinomWaldTest
function (maxit = 1000
for instance).Once you fix this, if you still have only a few DEG, then either there is a very small phenotype effect compared to residual variance OR sequencing depth is insufficient to measure robust gene expression changes. For further diagnostic, performing a PCA will allow to assess the overall portion of variability explained by the Phenotype effect and a MA-plot might help you assess if you have sufficient read count.
Hi Carlo,
Many thanks for your reply! I did the modification as suggested:
design
As you can see I have played with the maxit value (to increase it) because I am still getting the same warning message:
My code is not correct or someone knows what it did not solve the warning message?
Thanks again for your help, much appreciated. Pierre
I do not have in-depth knowledge of what
maxit
is actually doing but I think your filter is basically not doing much other than removing all-zero rows as you are only requesting that at least two samples have more than 10 counts but you have almost 150 samples, so you could still have like 98% of samples having zeros. I would either increase the filtering, say that at least, e.g. 10 or 20% per group must have counts above 10, or you can use thefilterByExpr
function inedgeR
which does filtering with respect to the sample size and design with defaults that seem to be generally well accepted. But first I would just try quick'n'dirty to require 10% of samples have some counts, which is roughly 15 samples here rather than just two, and then play with this theshold a bit and see whether it solves the issue:Hi ATpoint,
Many thanks for your reply!
I did what you suggested, but unfortunately I am still getting the same warning message!
Not sure what to do more! I will try to find something!
Thanks again for your help, Pierre