I am working on a 50 patients RNASeq experiment. Since both EdgeR and Deseq2 assumes a Negative Binomial distribution to infer differentially expressed genes, I would like to test whether this assumpion is applicable to my dataset. Does it makes sense? Morerover, which could be the fastest way to do so?
Tanks in advance
There is no quick and easy way to check this beyond those outlined by rpolicastro - following the QC steps in edgeR and DESew2 to check the mean-variance trend estimates.
You could also look at mean-variance relationships for indevidual genes - they should follow a quadratic relationship (mean + d*mean^2). Or fit negative binomial models using something like glm.nb from the MASS R package, and run a goodness of fit test on the result, but generally when you run a GOF test, it only really makes sense if you compare it to an alternative, and there arn't really any alternatives.
There are good theoretical reasons why RNAseq data should fit a negative binomial distribution. The mostly likely explaintation for an RNAseq dataset not fitting would be unacounted for hetrogenetity within your samples: Sub-groups within your condiditons that behave differently from one another. A PCA analysis (as mentioned by rpolicastro ) might help you identify this.
With 50 samples, if they are from different indeviduals, particularly if they are genetically different indevidauls, you almost certainly have some extra variances. Lots of variance is fine, as long as it is unstructured and randomly distributed. But you will have a problem if it forms groups. There are several tools for removing unobserved variance structure from large RNAseq datasets, such as SVA-seq. I recommend you look into those.
See this twitter thread from @MikeLove one of the the DEseq2 authors.
Got an RNA-seq dataset with 50, 100, 200+ samples? Plug it into a differential expression tool and hope for the best? No! You need to consider QC, EDA, and modeling technical variation, or else risk generating spurious results. A thread on papers, methods, and best practices: pic.twitter.com/p7Zn61QjHw
I suspected that there are several factors affecting my data. Not artifacts, but true biological factors. Is there a way to investigate their contribition? Moreover, it is plausible that there is no simple explanation. It would be reasonable performing unsupervised clustering to dived groups for a better understanding?
There are more advanced techniques that will try to deconvolve sources of variance, but if you know what they might be, the simplest thing is to include them in the design formula for the model. Beware of the table 2 falacy though: while you can add extra explanitory factors to a model in addition to the one being studied, it is dangerous in an observational study (i.e. one where exposures are not determined by the experiementor) to attribute different strengths of causaltion to different model factors when conducting multiple regression.
Follow the QC steps in the edgeR or DESeq2 manuals and those will give you a good idea about the quality of your input.
With a cohort that large the PCA analysis of your counts as described in the manuals will be of particular interest for you also.
I ll check them, thanks for the suggestion