Question

What does benchmark method exactly mean?

0

Entering edit mode

8.3 years ago

statfa ▴ 790

Hi,

My dictionary says, "benchmark means a level of quality that can be used as a standard when comparing other things". So a benchmark method is a standard method which can be used when you wanna compare other methods.

As I have seen, some packages such as "edgeR", "DESeq", "EBSeq", etc. are benchmark methods in detection of DE genes. I wish to know that when you don't access simulated data and you wanna compare two methods, can you assume that "edgeR" or any other benchmark method is standard and can be used to decide which one of your methods is better? I mean, can we compare two models with "edgeR" and choose the closest one to edgeR as a better model?

I also wish to know that how do you decide which model is better when you have a real dataset? the percentage of the overlapped genes found by each method with edgeR? I know another thing we can do is to check how many of the head genes (index genes) or house keeping genes are found by each model. What more can you recommend please?

Thank you.

benchmark methods edgeR DEG RNA-seq • 2.8k views

ADD COMMENT • link updated 8.3 years ago by Matteo Schiavinato ★ 3.7k • written 8.3 years ago by statfa ▴ 790

score 2 · Accepted Answer · 2017-03-07

2

Entering edit mode

8.3 years ago

Matteo Schiavinato ★ 3.7k

Hi, first of all I would recommend you to read this paper: http://www.nature.com/nbt/journal/v32/n9/full/nbt.3000.html

Benchmarking is very popular in modern bioinformatics because the rate at which new tools become available is higher than the actual assessment of which ones work better. Another point of discussion is if there is any tool that works best, or is it just depending on which data you have. Anyhow, benchmarking helps final users to have a reference to hold on to when trying to make sense of results.

can we compare two models with "edgeR" and choose the closest one to edgeR as a better model?

No. This is too simplicistic and will lead you to errors. A good approach would be to test the data with more than 2 algorithms and either take the intersection of results of the N algorithms used or the union of the intersections between at least 2 softwares. Anyhow, remember that there is no such thing as the perfect way. Try many times, see what you get.

I also wish to know that how do you decide which model is better when you have a real dataset?

I don't think there is a standardized way to achieve this. Having wet lab confirmation of in silico results is obviously the best thing, otherwise a sufficiently low adjusted p-value can be enough. This always depends on what you want to prove and how strongly you want to prove it. I think this field is the one that puts you in the situation where you actually have to take a decision and there is no band-in-the-gel telling you that what you have is correct.

ADD COMMENT • link 8.3 years ago by Matteo Schiavinato ★ 3.7k

0

Entering edit mode

Thanks a lot for your comprehensive explanation. My concern is that I have to compare two statistical models for my thesis and since I don't have the house keeping genes, I can't decide which model is working better. That's why I suggested using a benchmark method as a standard to compare the other two models. If I could simulate data, that would be easy to compare the models, but I don't know how to simulate data.

A good approach would be to test the data with more than 2 algorithms and either take the intersection of results of the N algorithms used or the union of the intersections between at least 2 softwares

If I understand you well, you said I can find the common DE genes in two or more benchmark methods and then use those genes as the index genes and check how many of them are detected by each one of my models, right? It seems good.

ADD REPLY • link 8.3 years ago by statfa ▴ 790

1

Entering edit mode

This is an idea, but is quite elaborate and perhaps not paying off more than other ways. What I would do is using DESeq2, EdgeR and Limma (just 3 examples, use what you think is best!) trying to use the same parameters in each so to not alter the outcome (that is, same alpha threshold, same null hypothesis, same same). The intersection of the three programs output is most likely to be a set of true positives!

A second level of positives is represented by the intersections between couples of programs. Like, DESEq2 and EdgeR identify a set of common genes which are not identified by Limma but they still are confirmed by both so you could keep them (if the adjusted p-value is significantly low).

A third level is composed of the genes that only a program identified. Be careful here: they might be the most significant ones! Some programs tend to overnormalize and others to undernormalize, and this can work out differently with different data. This means that a set of genes that only (f.e.) Limma identifies might still contain true positives. With very low adjusted p-values, you might as well keep some of them even if you can't really trust them as glittering gold.

At the end, perhaps sieve your list of identified differentially expressed genes by converting the counts to TPM and seeing if the values you get are reliable. http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/