I am currently working on a research project involving RNA-seq data analysis and have encountered a statistical challenge that I hope to receive your insights and suggestions on.
I have two sets of RNA-seq data (normalised for batch effects and log transformed) that I need to compare, and both sets have skewed distributions. The first is skewed to the right (with most values above 7 on a log2 transformed scale), while the values in the second are higher but skewed to the left. These skewed distributions violate the assumption of normality but also symmetry which is typical for traditional paired statistical tests like the paired Wilcoxon Signed Rank.
It's worth mentioning that these two datasets represent different genes, and my goal is not a differential expression analysis but rather a comparative study. I want to assess the difference in expression between two specific genes within the same experimental condition. Therefore packages such as edgeR and DESeq2 don't really fit my need
With over 300 samples in my dataset, I am looking for robust statistical methods or alternative approaches that can handle skewed data distributions and allow for a meaningful comparison between the two datasets.
I would greatly appreciate any insights or recommendations you might have regarding suitable statistical techniques or creative solutions to tackle this challenge.
Are these datasets completely independent to each other? Please describe the datasets more.
No, for each sample (row) the RNA-seq values of each gene (columns) are extracted from the same sample (row). Therefore the RNA-seq values of the genes are related.
Here is a better breakdwon of the dataset:
My goal is to compare the expression level of the 2 genes across all samples given that they are paired.
So basically gene 1 vs gene 2. I would just do a Wilcox test. Wilcox makes no distributional assumptions, and I personally think that expression levels anyway should not be compared because differences can be technical, for example GC bias, mappability etc.