Evaluating RNASeq counts distribution
1
0
Entering edit mode
2.1 years ago
Gama313 ▴ 130

I am working on a 50 patients RNASeq experiment. Since both EdgeR and Deseq2 assumes a Negative Binomial distribution to infer differentially expressed genes, I would like to test whether this assumpion is applicable to my dataset. Does it makes sense? Morerover, which could be the fastest way to do so? Tanks in advance

negative differential RNASeq expression binomial • 1.1k views
ADD COMMENT
1
Entering edit mode

Follow the QC steps in the edgeR or DESeq2 manuals and those will give you a good idea about the quality of your input.

With a cohort that large the PCA analysis of your counts as described in the manuals will be of particular interest for you also.

ADD REPLY
0
Entering edit mode

I ll check them, thanks for the suggestion

ADD REPLY
2
Entering edit mode
2.1 years ago

There is no quick and easy way to check this beyond those outlined by rpolicastro - following the QC steps in edgeR and DESew2 to check the mean-variance trend estimates.

You could also look at mean-variance relationships for indevidual genes - they should follow a quadratic relationship (mean + d*mean^2). Or fit negative binomial models using something like glm.nb from the MASS R package, and run a goodness of fit test on the result, but generally when you run a GOF test, it only really makes sense if you compare it to an alternative, and there arn't really any alternatives.

There are good theoretical reasons why RNAseq data should fit a negative binomial distribution. The mostly likely explaintation for an RNAseq dataset not fitting would be unacounted for hetrogenetity within your samples: Sub-groups within your condiditons that behave differently from one another. A PCA analysis (as mentioned by rpolicastro ) might help you identify this.

With 50 samples, if they are from different indeviduals, particularly if they are genetically different indevidauls, you almost certainly have some extra variances. Lots of variance is fine, as long as it is unstructured and randomly distributed. But you will have a problem if it forms groups. There are several tools for removing unobserved variance structure from large RNAseq datasets, such as SVA-seq. I recommend you look into those.

See this twitter thread from @MikeLove one of the the DEseq2 authors.

ADD COMMENT
0
Entering edit mode

I suspected that there are several factors affecting my data. Not artifacts, but true biological factors. Is there a way to investigate their contribition? Moreover, it is plausible that there is no simple explanation. It would be reasonable performing unsupervised clustering to dived groups for a better understanding?

ADD REPLY
0
Entering edit mode

There are more advanced techniques that will try to deconvolve sources of variance, but if you know what they might be, the simplest thing is to include them in the design formula for the model. Beware of the table 2 falacy though: while you can add extra explanitory factors to a model in addition to the one being studied, it is dangerous in an observational study (i.e. one where exposures are not determined by the experiementor) to attribute different strengths of causaltion to different model factors when conducting multiple regression.

ADD REPLY
0
Entering edit mode

Thank you for your help!

ADD REPLY

Login before adding your answer.

Traffic: 1916 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6