hi there,
I need to do some statistical analyses of my transcriptomes.
I have a database with 4 columns (gene ID (categoric), expression level (numeric), individual, species). I have 2 different species and 5 ind per species. Per each individual I have more than 20000 genes (some of them are more expressed than others).
What I want to know is whether is there differences between the expression level between species. The distribution of my data doesn't follow a Gaussian distribution.
For analysing my data I run:
wilcox.test(Exp~Species, data =data)
and then,
Wilcoxon rank sum test with continuity correction
data: Exp by Species
W = 8573700000, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
According to these result there should be a significan difference in the expression level between species.BUT:
- I am not sure if the analyses are appropiate for this dataset.
- Is there any way where I can take into account (as a random factor) the ID gene?
Thank you so much in advance
It probably follows the negative binomial. This is normal and expected. Check out the common differential analysis pipelines, such as DESeq2, edgeR or limma/voom. All are well-documented.
which type of expression data you have (rna-seq, microarray, etc...) ?
Sorry, I didn't say. It is RNA-seq
I think you are looking for differential expression analysis. Check the Bioconductor 2018 Workshop chapter 6 and 7 for more details.