Question

Forum:Is "training", fine tuning, or overfitting on "external independent validation datasets" considered cheating or scientific misconduct?

1

Entering edit mode

10 weeks ago

ivicts ▴ 10

Several computational biology/bioinformatics papers publish their methods in this case machine learning models as tools. To validate how accurate their tools generalize on other datasets, most papers are claiming some great numbers on "external independent validation datasets", when they have "tuned" their parameters based on this dataset. Therefore, what they claim is usually the best-case scenario that won't generalize on new data especially when they claim their methods as a tool. Someone can claim that they have a better metric compared to the state of the art just by overfitting on the "external independent validation datasets".

Let's say the same model gets AUC=0.73 on independent validation data and the best method now has AUC=0.8. So, the author of the paper will "tune" the model on the independent validation data to get AUC=0.85 to be published. Essentially the test dataset is not an "independent external validation set" since you need to change the hyperparameter for the model to work well on that data. If someone publishes this model as a tool, then the end user won't be able to change the hyperparameter to get a better performance. So, what they are doing is essentially only a proof of concept in the best-case scenario, should not be published as a tool and it is disingenuous to call it a tool.

Would this be considered "cheating" or "scientific misconduct"?

If it is not cheating, the easiest way to beat the best method is to have our own "interdependent external validation set", tune our model based on that and compare it with another method that is only tested without fine-tuning on that dataset. This way, we can always beat the best method.

I know that in ML papers, overfitting is common, but ML papers rarely claim their method as a tool that can generalize and that is tested on "external independent validation datasets".

tools scientific misconduct overfitting • 1.5k views

ADD COMMENT • link 10 weeks ago by ivicts ▴ 10

2

Entering edit mode

“scientific misconduct” presumes some sort of intention to commit fraud. If someone does it knowingly and intentionally and hides the fact they did it and the effect is severe, then one could make an argument for it. But for the most part, I’d rather call it inaccurate.

For bioinformatics, if someone sets, say, a numerical parameter that causes their metric to appear better than the state-of-the-art but that parameter was determined after testing a ton of parameter values on that dataset, then the way they tested it is inaccurate.

ADD REPLY • link 10 weeks ago by dsull ★ 6.9k

0

Entering edit mode

Yes, but they claim the performance on the "external independent validation set" when it is not really independent since you have been testing repeatedly, so it is a fraud? Especially, when other papers really do report on an "external independent validation set" without tuning. So, the difference between the state of the art and not the state of the art is just hyperparameter fine-tuning and not that someone's method is better.

ADD REPLY • link 10 weeks ago by ivicts ▴ 10

1

Entering edit mode

Again, doing something incorrectly does not mean engaging in misconduct.

ADD REPLY • link 10 weeks ago by dsull ★ 6.9k

0

Entering edit mode

Yeah, but if someone claims that their methods work on the external data with AUC = 0.95, then you test their method on your own data and you get AUC = 0.75, then you are not able to reproduce their result on external data. Would that be misconduct?

Let's say the same model gets AUC=0.73 on independent validation data and the best method now has AUC=0.8. So, the author of the paper will "tune" the model on the independent validation data to get AUC=0.85 to be published. Is this scientific misconduct or is it just inaccurate?

When does doing something incorrectly become misconduct? If it is intentionally done, does doing things inaccurately become misconduct?

ADD REPLY • link 10 weeks ago by ivicts ▴ 10

0

Entering edit mode

No, that is not misconduct. Misconduct and doing something incorrectly are two completely different things.

Doing something incorrectly and claiming AUC 0.95 could be an innocent mistake. Yes, the paper should be corrected (or retracted if a correction is not possible). But no misconduct.

I've already explained what the line for misconduct would be in my initial comment.

Innocent until proven guilty. Making a mistake is not a crime. As another example, claiming a significant p-value whereas the true p-value is not significant is not misconduct in many cases (if it were, you'd have to prosecute thousands of professors in biology or medicine).

As was already stated, it's hard to have this discussion on purely hypotheticals because you're omitting context (was it carelessness? was it ignorance? was it deliberate? etc.).

ADD REPLY • link 10 weeks ago by dsull ★ 6.9k

0

Entering edit mode

ohh okay so basically you are saying that is not misconduct but it is some sort of cheating that should not be done? I mean if the paper is corrected, it would already be published in a high impact journal, right?

So, when you describe p value hacking, does it need to be retracted? Let's say when the datasets that were used to train the models were not consistent, there is a new version of the dataset that is “easier” (due to random seed, and slightly different codes), but they used two datasets interchangeably and pretend it as one dataset in the paper. Is this a misconduct?

Of course in this case the context is deliberate and intentional to get the paper into high impact journal.

ADD REPLY • link 10 weeks ago by ivicts ▴ 10

0

Entering edit mode

If you're deliberately cheating to get into a high impact journal, then yes, it's misconduct.

If, even with a correction, the main conclusion can't hold true, then that's when retraction is recommended.

For p value hacking, again, that's why I mentioned in my original comment: if "the effect is severe". If your type I error is slightly inflated because you did some peeking, it's bad practice but won't be retracted. Unless of course, based on that study, you conclude that vaccines cause autism or something crazy like that.

ADD REPLY • link 10 weeks ago by dsull ★ 6.9k

0

Entering edit mode

Is the story that I described where someone fine-tuned based on an external independent validation set deliberately to get into a high-impact journal considered misconduct? The context is intentional.

Because it's very easy to get into high-impact journals that way, we can have our own "interdependent external validation set", tune our model based on that, and compare it with another method that is only tested without fine-tuning on that dataset. This way, we can always beat the best method.

ADD REPLY • link 10 weeks ago by ivicts ▴ 10

0

Entering edit mode

Yes, that would likely be misconduct.

ADD REPLY • link 10 weeks ago by dsull ★ 6.9k

0

Entering edit mode

Okay great.. at least we agree on something after going back and forth for a while..

ADD REPLY • link 10 weeks ago by ivicts ▴ 10

score 0 · Answer 1 · 2024-08-15

It depends almost entirely on the claim that's being made. Mostly the claims treat "external datasets" as benchmarks, and even in ML benchmark performance is understood to be a pretty noisy indicator of superiority, except in cases where relative improvements in the high double-digits are achieved. I don't think anyone makes the claim that superiority on benchmarks A, B, C implies that there will be superiority on a newly-constructed yet-to-be-seen benchmark. And, in some sense, all models are "fine tuned" to benchmarks - in the sense that if the model you come up with performs worse on a benchmark, you modify it or simply scrap the underlying idea.

As an aside - tools don't have a concept of "generalizability" -- models do. You can't really talk about alignment as generalizing or failing to generalize. It'd be helpful if you could be more specific about precisely what kinds of "tools" you are referring to.

score 0 · Answer 2 · 2024-08-15

0

Entering edit mode

10 weeks ago

Mensur Dlakic ★ 28k

To validate how accurate their tools generalize on other datasets, most papers are claiming some great numbers on "external independent validation datasets", when they have tuned their parameters based on this dataset. Therefore, what they claim is usually the best-case scenario that won't generalize on new data especially when they claim their methods as a tool.

To the best of my knowledge, hardly anyone does this. Why would anyone accept models that do not generalize well?

Many researchers train on a set of data and get an estimate of model's performance, usually by cross-validation. This is where external independent data comes in: those models are now evaluated on external validation data not seen during the training. The idea is that a cross-validated AUC=0.98 on train data means very little if the same model gets AUC=0.73 on independent validation data. That would be a clear case of overfitting.

Having said that, there are authors who do not apply proper training and validation practices, but that is usually out of ignorance rather than malice. This I know firsthand, having reviewed some papers where this was the case. I think there are likely to be at least some reviewers who also don't know the best practice, so a substandard paper might slip through here and there. Still, you are making it sound as this is somewhat frequent in the literature, and that's not the case.

ADD COMMENT • link 10 weeks ago by Mensur Dlakic ★ 28k

0

Entering edit mode

How do reviewers know which one is an "external independent validation set" and which one is just "fine-tuned"? Let's say the same model gets AUC=0.73 on independent validation data and the best method now has AUC=0.8. So, the author of the paper will "tune" the model on the independent validation data to get AUC=0.85 to be published. Is this a scientific misconduct?

ADD REPLY • link 10 weeks ago by ivicts ▴ 10

1

Entering edit mode

Any deliberate data fabrication is a scientific misconduct, yet I don't know of anyone who does what you brought up in your hypotheticals. How would you even know about it? Most reputable journals would not publish a miniscule improvement in evaluation results unless it was achieved by some kind of a methodological improvement.

You keep mentioning tuning and fine-tuning. Publishing in professional journal is not about pushing a model forward by 0.5% by exhaustively trying many sets of parameters. If anyone actually used the validation data for training without reporting it and for the purpose of later obtaining a better validation score, that would be a fraud. Again, how do you know that has happened?

I don't think anyone can debate this with you in depth while talking only in hypotheticals.

ADD REPLY • link 10 weeks ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Yes, but no one knows whether it is the effect of methodological improvement or it is the effect of fine-tuning right? The exact same model can work better if it is tuned on the test set specifically.

No one will fine-tune the model to get a better 0.5%. It will be probably tuned until a 10% improvement or something better than state-of-the-art. What I am saying is not using validation data for training, but you "tune" the parameters based on an "independent external validation set", so if the model doesn't work well on that, you change the model's hyperparameter until it works better than the state of the art. So, essentially the test dataset is not really "independent external validation set" since you need to change the hyperparameter for the model to work well on that data.

ADD REPLY • link 10 weeks ago by ivicts ▴ 10

0

Entering edit mode

This is literally how it works in ML. Model architectures that are trained on a given dataset and tested on validation tasks do not get submitted for publication unless they perform better on the tasks. There is a very strong publication censorship bias, and this is very well understood, which is why there are multiple different performance metrics, the idea being that it's hard to find hyperparameters that coincidentally work well across multiple tasks.

It should be noted that even if you remove the conduct of hyperparameter tuning this censorship still exists - it operates at the level of submission and peer review process, and not at the level of any individual author. One lab chooses lambda=2 and doesn't publish, the other lab has the same idea and chooses lambda=1/2 and publishes.

The solution is to have more, and more varied, validation tasks, rather than moral hyperventilation.

ADD REPLY • link 10 weeks ago by LChart 4.3k

0

Entering edit mode

Yes, I know but for ML there are benchmark datasets and the previous methods are probably already overfitted on that dataset, so if we have a new model and we overfitted on that dataset, it is still a fair apple to apple comparison. This is different from a comp bio/ bioinformatics papers, where there are rarely benchmark dataset and everyone has their own "independent external validation set" that can be fine tuned which are not fair comparison with other paper / model that are not tuned on that dataset.

Also, ML papers never claim that they publish their method as a tool that other people can use on their own dataset whereas some bioinformatics paper publish their methods as a tool that people expect to work on their own data that was never seen before by the model.

ADD REPLY • link 10 weeks ago by ivicts ▴ 10