Question

P-hacking or not in pan-cancer analysis

0

Entering edit mode

6.1 years ago

Wenhu_Cao ▴ 100

Hi guys,

I have read a few pan-cancer analysis papers, really big papers from CNS. Then, I am confused about the way what they are doing with the data.

Normally, the data come from TCGA or other similar databases, these data are collected without any scientific hypothesis beforehand obviously, just dumped from bunches of sequencings and arrays (surely with careful selection, qc, normalization, etc). What those paper normally do is first to find statistical differences across all samples, cancer-types, genes, etc, then they 'zoom-in', to compare different subset of samples, cancer types, genes or other stuffs of interest, in order to find more delicate/subtle statistical differences, more interesting phenomena. At last, make up a story about it.

My question is, Isn't that a violation about statistical test assumptions? Aren't that comparisons multiple comparisons? Should we really analyze data after we see them and without any scientific hypothesis in advance?

Confused...

pan-cancer p-value genome • 1.1k views

ADD COMMENT • link updated 4.1 years ago by Jean-Karim Heriche 27k • written 6.1 years ago by Wenhu_Cao ▴ 100

0

Entering edit mode

Could you post links to these published manuscripts?

ADD REPLY • link 6.1 years ago by Kevin Blighe 88k

score 1 · Answer 1 · 2020-11-20

Should we really analyze data after we see them and without any scientific hypothesis in advance?

In general the answer to this question would be no. Very often experiments are carried out with the aim of testing a specific hypothesis and even when not, experiments are not carried out in a vacuum of knowledge. There's plenty of context to an experiment that can inform on what is relevant. However, there is another type of experiments going back to the roots of biology which consist in characterizing a system. These are essentially observational studies. I believe some of the genomics/systems biology studies fall into this category while some also fall into the first.

Isn't that a violation about statistical test assumptions? Aren't that comparisons multiple comparisons?

There's nothing wrong with exploring data to generate testable hypotheses but the problem is that many papers stop there and those hypotheses are never tested. Because of this, in other fields of biology, genomics and bioinformatics in general are often considered as producing noise. If all you have as evidence is a p-value, you're not going to convince many people that you're onto something interesting. Why? Because there's no prescribed relationship between statistical significance and biological relevance. It is fair to use statistics to determine where to focus efforts and resources but to be credible, p-value-derived hypotheses need to be experimentally tested. To come back to the cancer genomics field, I find the situation analogous to high-throughput screening. You collect data, statistically find hits but then you validate at least some of these hits in targeted independent experiments. No screening paper would be published without that last part but in many genomics studies the hit validation part is missing.