Hello!
I am quite new to bioinformatic so I hope my question will be clear enough.
I am trying to run a DESeq2 analysis on 25 bovine tumor samples. Among them I have two technical replicates of my unique control (I know is not ideal) and most of my "treated" samples have one technical replicate too. Before any DESeq analysis I had to drop a few samples because the quality of the RNA-seq was not good enough.
design = ~Group
Overview of colData
row.names sample Group
sample1 sample1 treated
sample2 sample1 treated
sample3 sample2 control
sample4 sample2 control
I tried two different approaches: Either start the DESeq analysis without specifying that I had technical replicates (dds
)or using the collapseReplicates
function based on the colData
sample column to merge the reads (ddsCollapsed
).
dds <- DESeqDataSetFromMatrix(matrix, colData, design)
ddsCollapsed<- collapseReplicates(dds, groupby= colData(dds)$sample, renameCols=T)
My problem lies in the DESeq analysis:
DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 6787 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
DESeq(ddsCollapsed)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 24224 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
I am working with bovine ENSEMBL annotation which contains ~24660 entries...
I was really surprised by the number of outliers. Moreover, the MA plots from those two analysis are really not great (I join to this post the one of ddsCollapsed
):
I have already red the supplementary data about Cook's distance.
So my questions are the following:
- Do I have to worry about such high number of outliers? Is it common? What could be the reasons leading to those numbers?
- If yes to (1), what can I do to overcome this trouble?
- A unrelated question: Is it possible to put missing values (NA) in the colData table? I tried and got this error:
Error in t(hatmatrix %*% t(y)) :
"error in evaluating the argument 'x' in selecting a method for function 't': Error in hatmatrix %*% t(y) : non-conformable arguments"
Thanks for reading this long post! Any advice would be appreciated! Vincent
Thanks it was exactly the answers I needed!