Question

A Rank-Weighted Similarity Score

5

Entering edit mode

11.9 years ago

a3cel2 ▴ 50

I want to compare two runs of a similar experiment that ranks genes by an arbitrary score and using any standard correlation metric (e.g. Spearman rank correlation) shows that they are not very similar. However, this is because the experiment is set up in such a way that it has very confident measures for the top scores, but after a while the results become mostly noise. So let's say I had 1000 genes measured and only the top ~30 or so from each run I are the ones I care about -

Is there any similarity metric I could use that gives a higher penalty for differences in rank between high confidence genes (so I would punish it a lot if it rank 10 in one experiment and rank 100 in the other) but the penalty drops off as the ranks being compared gets lower (so I don't particularly care if something is rank 500 in one experiment and rank 1000 in the other)?

I could simply do a top-N overlap approach, but this seems a little bit simplistic and I don't know how to pick 'N' in an unbiased fashion

similarity • 3.7k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 11.9 years ago by a3cel2 ▴ 50

Ram · Answer 1 · 2013-01-22

One approach would be to find an empirical significance threshold, identify "positive" genes by that criterion, and assess overlap between positive gene sets in a standard way (e.g. hypergeometric test).

If you're feeling more technical, you could look into IDR ("irreproducible discovery rate"), a concept/statistical framework developed by the ENCODE project. It deals with this exact question of how to compare ranked lists of scores (assigned to genomic regions). See here and here for more details.

Ram · Answer 2 · 2013-01-22

1

Entering edit mode

11.9 years ago

Ryan Thompson ★ 3.6k

You can make CAT plots using the BioConductor package matchBox. These plots essentially plot top-N overlap for all values of N.

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 11.9 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

I just thought is the correlation between dataSetA.t and dataSetB.t is higher than that of dataSetA.t.vs.dataSetC in vig (http://www.bioconductor.org/packages/release/bioc/vignettes/matchBox/inst/doc/matchBox.pdf)?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Zhilong Jia ★ 2.2k

Ram · Answer 3 · 2013-01-23

1

Entering edit mode

11.9 years ago

Leonor Palmeira 3.9k

I don't think there is much of a difference between: having to define N arbitrarily (which is like giving confidence 1 to the N top genes and confidence 0 to all the lower genes), and having to define a confidence measure (or noise measure) that is very high for the top genes (confidence around 1), and lower when you get to the bottom of your experiment (confidence decreasing to 0).

Correct if I'm wrong, but I am under the impression you do not have such a score available as such,so you would have to define it arbitrarily.

If you do have such a confidence score (let's say a p-value of some sort), I think you are looking for some kind of weighted rank correlation.

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 11.9 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

Actually, I do have a p-value available - so this is perfect, thank you!

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 11.9 years ago by a3cel2 ▴ 50