I am trying to perform Kruskal-Wallis test on RNA-Seq data, which has 20 genes of interest and looks like following:
Gene_id Con_1 Con_2 Con_3 Mut_1 Mut_2 Mut_3
Gene_001 -0.173575646 0.519571535 -5.87735812 2.023648932 1.94668789 1.56102541
Gene_002 -0.185999458 -0.118197772 0.129667462 -0.071623581 0.249618688 -0.003465339
Gene_003 3.486046831 -5.693834334 -1.088664148 0.009948141 3.682020477 -0.395516967
Only the first 3 rows are shown to demonstrate what data looks like. The first column is the gene id and columns 2-4 (WT) are three independent biological replicates of control and columns (5-7) are three independent biological replicates of mutations (Mut) as treatment. The data has been log transformed.
Here are my questions:
Q #1. A couple of posts on this forum suggested to transpose and/or melt data on R. How exactly should I do it?
Q #2. Given control and treatments have three replicates each, how the Kruskal-Wallis test should be performed? Should I average three replicates for control and average for mutation? In other words, what would be the best way to perform the Kruskal-Wallis test with RNA-seq data where it has replicates?
Any help/suggestion would be appreciated. Thank you.
You can do something like this. I m using your data above and Gene_004-Gene_009 are duplicates of what you have in above.
bk11 , thank you very very much! It worked! I sincerely appreciate example codes with detailed comments. If it is ok, may I ask what does
res
do for calculating the Kruskal-Wallis test? It is basically for looping calculation of the test for all genes, which are defined in~ groups
?Yes, you are correct, that is what it is meant to do.
Thank you very much once again for your help!
I find it curious that the p values for different genes turned out to be one of two values 0.04953461 and 0.27523352 why is that?
For demo purpose, I created duplicates of Gene001-Gene003 for Gene004-Gene009.
I understand that, but that does not fully explain why Gene_002 has the same p value as Gene_003, that is what stood out to me when looking at the data. The data for the two columns seems so radically different.
I would recommended generating these examples with some sort of random data. An example that shows an unexpected pattern can be very distracting when troubleshooting or answering a question - people may be diverted to suspect some sort of methodological error along the way
The lowest, second lowest, and second highest values are controls in both genes 002 and 003. That it's so easy to get the same output suggest to me that this is a not helpful test for n = 6.
What is the purpose of the test being conducted here? Are you trying to get one p-value per gene? In which case, why are you doing a Kruskal-Wallice test, rather than a Mann-Whitney U test if you insist on doing a non-parametric test (although see ATPoint's post below, you don't really have the power for non-parametric when testing per-gene)?
Or is the idea to do some sort of gene-set test and get one overall p-value?