A Rank-Weighted Similarity Score
3
5
Entering edit mode
11.8 years ago
a3cel2 ▴ 50

I want to compare two runs of a similar experiment that ranks genes by an arbitrary score and using any standard correlation metric (e.g. Spearman rank correlation) shows that they are not very similar. However, this is because the experiment is set up in such a way that it has very confident measures for the top scores, but after a while the results become mostly noise. So let's say I had 1000 genes measured and only the top ~30 or so from each run I are the ones I care about -

Is there any similarity metric I could use that gives a higher penalty for differences in rank between high confidence genes (so I would punish it a lot if it rank 10 in one experiment and rank 100 in the other) but the penalty drops off as the ranks being compared gets lower (so I don't particularly care if something is rank 500 in one experiment and rank 1000 in the other)?

I could simply do a top-N overlap approach, but this seems a little bit simplistic and I don't know how to pick 'N' in an unbiased fashion

similarity • 3.7k views
ADD COMMENT
2
Entering edit mode
11.8 years ago
matted 7.8k

One approach would be to find an empirical significance threshold, identify "positive" genes by that criterion, and assess overlap between positive gene sets in a standard way (e.g. hypergeometric test).

If you're feeling more technical, you could look into IDR ("irreproducible discovery rate"), a concept/statistical framework developed by the ENCODE project. It deals with this exact question of how to compare ranked lists of scores (assigned to genomic regions). See here and here for more details.

ADD COMMENT
1
Entering edit mode
11.8 years ago
Ryan Thompson ★ 3.6k

You can make CAT plots using the BioConductor package matchBox. These plots essentially plot top-N overlap for all values of N.

ADD COMMENT
0
Entering edit mode

I just thought is the correlation between dataSetA.t and dataSetB.t is higher than that of dataSetA.t.vs.dataSetC in vig (http://www.bioconductor.org/packages/release/bioc/vignettes/matchBox/inst/doc/matchBox.pdf)?

ADD REPLY
1
Entering edit mode
11.8 years ago

I don't think there is much of a difference between: having to define N arbitrarily (which is like giving confidence 1 to the N top genes and confidence 0 to all the lower genes), and having to define a confidence measure (or noise measure) that is very high for the top genes (confidence around 1), and lower when you get to the bottom of your experiment (confidence decreasing to 0).

Correct if I'm wrong, but I am under the impression you do not have such a score available as such,so you would have to define it arbitrarily.

If you do have such a confidence score (let's say a p-value of some sort), I think you are looking for some kind of weighted rank correlation.

ADD COMMENT
0
Entering edit mode

Actually, I do have a p-value available - so this is perfect, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6