Question

Kruskal-Wallis rank test between different subsets of data table

0

Entering edit mode

5.8 years ago

dzisis1986 ▴ 70

Hello i have a table like this :

chr  start    end    con_1_1   con_1_2   con_1_3  con_2_1  con_2_2  con_2_3
    1      1   7512 0.45180723 0.21982759 0.06666667 0.4105960 0.1024735 0.2284710
    1  13169  20070 0.07142857 0.77631579 0.90434783 0.1363636 0.8985507 0.6033058
    1  36598  37518 0.13750000 0.43300248 0.09113300 0.9612403 0.1233596 0.7459016
    1  37512  40365 0.64940239 0.95954693 0.46091644 0.7251656 0.1325648 0.4121901
    1  40359  48801 0.09504132 0.96491228 0.15428571 0.6388889 0.5165165 0.8050847
    1  77084  83129 0.91773779 0.28978224 0.56115108 0.9587302 0.5469256 0.6995614
    1  83123  87907 0.86226415 0.05175159 0.93600000 0.8953975 0.5000000 0.8991597
    1  87901  90973 0.08943089 0.08850365 0.60000000 0.3804809 0.8990385 0.9858824
    1 101231 108778 0.11898734 0.40900735 0.08300781 0.7094156 0.4553571 0.2787356
    1 108792 109423 0.12676056 0.24483776 0.56803456 0.4175824 0.3546196 0.5549451

My data are in 2 conditions with 3 replications in each condition. I would like for each row to run a Kruskal-Wallis rank sum test.Which means that in each row the con1 ( with 3 values ) will be tested with con2 ( with 3 values ). Ath the end i will have a final table with chr start end and one column with the result of the test for each row.

This is what i tried but its very slow.

newdata <- read.csv("table.txt",header = T, sep="\t")
len <- nrow(newdata)
for (j in 1:len) { 

  data=newdata[len,]
  flabel<-factor(c(rep("con1",3),rep("con2",3)))
  data1=c(data[,4],data[,5],data[,6],data[,7],data[,8],data[,9])
  datav=data.frame(flabel,data1)
  test=kruskal.test(data1 ~ flabel, data = datav)
  print(test$p.value)

}

Any suggestion or help to run the Kruskal-Wallis for each row in a faster way ?

Thank you

R statistics kruskal-walls test • 3.5k views

ADD COMMENT • link updated 5.8 years ago by gbl1 ▴ 80 • written 5.8 years ago by dzisis1986 ▴ 70

0

Entering edit mode

are all "con_1_1" reads from the same "individual" "replicate" or what ever, are they differant fragments of the same thing/ chromosome/ etc?

ADD REPLY • link 5.8 years ago by gbl1 ▴ 80

Ram · Answer 1 · 2019-02-01

1

Entering edit mode

5.8 years ago

gbl1 ▴ 80

Hi, you are I think not looking for Kruskal-Wallis but Test Mann-Whitney. R function is wilcox.test().

So, for the first line, you would have to run something like:

wilcox.test(newdata[1,4:6], newdata[1,7:9])

If it does not work, let me know and I try ;)

ADD COMMENT • link updated 5.8 years ago by Ram 44k • written 5.8 years ago by gbl1 ▴ 80

0

Entering edit mode

Why not Kruskal-Wallis ? The Mann-Whitney is suggested because i have 3 valuews for each condition ? What is the difference? How i will use this or other test in the for loop for a big data set with more than 10000 lines ?

ADD REPLY • link 5.8 years ago by dzisis1986 ▴ 70

1

Entering edit mode

Mann-Whitney/Wilcoxon is for the comparison of two groups. Kruskall-Wallis extends it to 3 or more groups. If you want to compare two conditions then Mann-Whitney is the right test.

ADD REPLY • link 5.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

As explained, you compare 2 groups, of data, each of those groups need at least 4 values to get a Pv < 5% with MWW… Ooops, I'm affraid to tell you you won't have any relevant results. When I know I will do non parametric tests, I allways take 5 samples/mesurement

ADD REPLY • link 5.8 years ago by gbl1 ▴ 80

0

Entering edit mode

So what kind of non parametric test or other test you suggest in my case . Its fixed that for each condition i will have 3 samples .

ADD REPLY • link 5.8 years ago by dzisis1986 ▴ 70

0

Entering edit mode

I advise you to replicate your experiment, and pool the data… There is to my knowledge no test able to analyse your data.

If it exists, it might be in the "exact test" type… I only know the exact test of Fisher that is like a Chi-square, so not appropriate for your case… As a biologist, we are used to do only 3 replicates… It is a terrible way of doing. It could be enough for a t of student but I actually never met the case a student test was actually appropriate (even if everyone do so)

Maybe there is a kind of possibility… what are your conditions and mesurements, what are the start and end meaning? is there anything you could paired?

ADD REPLY • link 5.8 years ago by gbl1 ▴ 80

0

Entering edit mode

Hi, Thank you for your advice . Mainly i agree with you but in my case we have ranked counts in fragments . so the start and end are the coordinates of each fragment. So for each fragments i want to compare the ranks in replicates of each condition with the other condition.

ADD REPLY • link 5.8 years ago by dzisis1986 ▴ 70

0

Entering edit mode

Fragment of?

are all con_1_1 related to the same individual?

is there correlation?

ADD REPLY • link 5.8 years ago by gbl1 ▴ 80

0

Entering edit mode

No each line is a fragment. so for each fragment we have 2 conditions with 3 replications and for each line for those 2 groups ( with 3 values each) we want to make a non parametric test . So at the end the table above will have one extracolumn at the end for each line with the result of the test !

ADD REPLY • link 5.8 years ago by dzisis1986 ▴ 70

0

Entering edit mode

replicate the experiment in the same way, pool and do a MTW test… I wonder what your fragments are?

ADD REPLY • link 5.8 years ago by gbl1 ▴ 80

0

Entering edit mode

Fragments are coordinates in the genome were reads are found. i use the count of reads after mapping and then i calculate the ranks. The result of ranking is a table like that and i want to perform statistical test on that.

ADD REPLY • link 5.8 years ago by dzisis1986 ▴ 70

0

Entering edit mode

Why are your "ranks" decimal ?

I try to understand, it might have a solution, but I have to be sure to get what you are doing…

ADD REPLY • link 5.8 years ago by gbl1 ▴ 80

0

Entering edit mode

Ranks are coming by using the rank function of R to normalize the count data. It is a way of normalizing ngs data in roder to have a result between 0-1. The biggest rank the more counts.

ADD REPLY • link 5.8 years ago by dzisis1986 ▴ 70