Question

Evaluating RNASeq counts distribution

0

Entering edit mode

2.4 years ago

Gama313 ▴ 130

I am working on a 50 patients RNASeq experiment. Since both EdgeR and Deseq2 assumes a Negative Binomial distribution to infer differentially expressed genes, I would like to test whether this assumpion is applicable to my dataset. Does it makes sense? Morerover, which could be the fastest way to do so? Tanks in advance

negative differential RNASeq expression binomial • 1.2k views

ADD COMMENT • link 2.4 years ago by Gama313 ▴ 130

1

Entering edit mode

Follow the QC steps in the edgeR or DESeq2 manuals and those will give you a good idea about the quality of your input.

With a cohort that large the PCA analysis of your counts as described in the manuals will be of particular interest for you also.

ADD REPLY • link 2.4 years ago by rpolicastro 13k

0

Entering edit mode

I ll check them, thanks for the suggestion

ADD REPLY • link 2.4 years ago by Gama313 ▴ 130

score 2 · Answer 1 · 2022-11-17

There is no quick and easy way to check this beyond those outlined by rpolicastro - following the QC steps in edgeR and DESew2 to check the mean-variance trend estimates.

You could also look at mean-variance relationships for indevidual genes - they should follow a quadratic relationship (mean + d*mean^2). Or fit negative binomial models using something like glm.nb from the MASS R package, and run a goodness of fit test on the result, but generally when you run a GOF test, it only really makes sense if you compare it to an alternative, and there arn't really any alternatives.

There are good theoretical reasons why RNAseq data should fit a negative binomial distribution. The mostly likely explaintation for an RNAseq dataset not fitting would be unacounted for hetrogenetity within your samples: Sub-groups within your condiditons that behave differently from one another. A PCA analysis (as mentioned by rpolicastro ) might help you identify this.

With 50 samples, if they are from different indeviduals, particularly if they are genetically different indevidauls, you almost certainly have some extra variances. Lots of variance is fine, as long as it is unstructured and randomly distributed. But you will have a problem if it forms groups. There are several tools for removing unobserved variance structure from large RNAseq datasets, such as SVA-seq. I recommend you look into those.

See this twitter thread from @MikeLove one of the the DEseq2 authors.

Got an RNA-seq dataset with 50, 100, 200+ samples? Plug it into a differential expression tool and hope for the best? No! You need to consider QC, EDA, and modeling technical variation, or else risk generating spurious results. A thread on papers, methods, and best practices: pic.twitter.com/p7Zn61QjHw
— Michael Love (@mikelove) April 11, 2022