Question

Consistency of group memberships across variables

0

Entering edit mode

8.2 years ago

mforde84 ★ 1.4k

Hi,

I did some similarity network fusion with mRNA and miRNA, and I'm generating a variety of potential clustering options which consist of between 2-5 possible members. I'm interested in testing for membership similarity between multiple categorical variables, in particular those memberships which predict the same number of optimal clusters.

For instance, let's say that two categorical variables with 3 levels have the following membership:

Group1
1
1
2
2
3
3

Group2
2
2
1
1
3
3

I want to test how consistently samples group together across these variables. The name of (1,2,3) is irrelevant and strictly qualitative. In this instance, it would be a perfect match because the 2 matches bidirectionally to 1.

Is there a statistical test that I can apply to test this? I had read that chi square might be appropriate, but I'm still a little fussy on how to interpret it in my application, since I don't think it accounts for the semantic equivalences between 1 and 2 in the different groups.

Any suggestions?

membership • 1.9k views

ADD COMMENT • link updated 8.2 years ago by Jean-Karim Heriche 27k • written 8.2 years ago by mforde84 ★ 1.4k

0

Entering edit mode

Well? Anyone have any suggestions? I mean come now, this isn't stack exchange guys.

ADD REPLY • link 8.2 years ago by mforde84 ★ 1.4k

0

Entering edit mode

The simplistic thing to do is to use a stacked bar plot of your data, and see the grouped distribution. You should code your samples to avoid semantic issues. I don't think any statistic will 'help' you in this matter. At this point your data seems purely based on frequency in a small amount of groups as well as among a small amount of samples...

ADD REPLY • link 8.2 years ago by theobroma22 ★ 1.2k

0

Entering edit mode

Sounds reasonable. If I could recode them properly, I could even do a contingency table.

ADD REPLY • link 8.2 years ago by mforde84 ★ 1.4k

score 1 · Answer 1 · 2017-07-03

Your problem amounts to measuring similarity between sets. There are plenty of similarity measures for sets (e.g. Jaccard index), and you can get a p-value for the overlap between two sets using the hypergeometric distribution.
As for the semantic relationship, only you can tell how to account for it since we have no information on this. The standard way of dealing with semantic relations is through ontologies.