Several computational biology/bioinformatics papers publish their methods in this case machine learning models as tools. To validate how accurate their tools generalize on other datasets, most papers are claiming some great numbers on "external independent validation datasets", when they have "tuned" their parameters based on this dataset. Therefore, what they claim is usually the best-case scenario that won't generalize on new data especially when they claim their methods as a tool. Someone can claim that they have a better metric compared to the state of the art just by overfitting on the "external independent validation datasets".
Let's say the same model gets AUC=0.73 on independent validation data and the best method now has AUC=0.8. So, the author of the paper will "tune" the model on the independent validation data to get AUC=0.85 to be published. Essentially the test dataset is not an "independent external validation set" since you need to change the hyperparameter for the model to work well on that data. If someone publishes this model as a tool, then the end user won't be able to change the hyperparameter to get a better performance. So, what they are doing is essentially only a proof of concept in the best-case scenario, should not be published as a tool and it is disingenuous to call it a tool.
Would this be considered "cheating" or "scientific misconduct"?
If it is not cheating, the easiest way to beat the best method is to have our own "interdependent external validation set", tune our model based on that and compare it with another method that is only tested without fine-tuning on that dataset. This way, we can always beat the best method.
I know that in ML papers, overfitting is common, but ML papers rarely claim their method as a tool that can generalize and that is tested on "external independent validation datasets".
“scientific misconduct” presumes some sort of intention to commit fraud. If someone does it knowingly and intentionally and hides the fact they did it and the effect is severe, then one could make an argument for it. But for the most part, I’d rather call it inaccurate.
For bioinformatics, if someone sets, say, a numerical parameter that causes their metric to appear better than the state-of-the-art but that parameter was determined after testing a ton of parameter values on that dataset, then the way they tested it is inaccurate.
Yes, but they claim the performance on the "external independent validation set" when it is not really independent since you have been testing repeatedly, so it is a fraud? Especially, when other papers really do report on an "external independent validation set" without tuning. So, the difference between the state of the art and not the state of the art is just hyperparameter fine-tuning and not that someone's method is better.
Again, doing something incorrectly does not mean engaging in misconduct.
Yeah, but if someone claims that their methods work on the external data with AUC = 0.95, then you test their method on your own data and you get AUC = 0.75, then you are not able to reproduce their result on external data. Would that be misconduct?
Let's say the same model gets AUC=0.73 on independent validation data and the best method now has AUC=0.8. So, the author of the paper will "tune" the model on the independent validation data to get AUC=0.85 to be published. Is this scientific misconduct or is it just inaccurate?
When does doing something incorrectly become misconduct? If it is intentionally done, does doing things inaccurately become misconduct?
No, that is not misconduct. Misconduct and doing something incorrectly are two completely different things.
Doing something incorrectly and claiming AUC 0.95 could be an innocent mistake. Yes, the paper should be corrected (or retracted if a correction is not possible). But no misconduct.
I've already explained what the line for misconduct would be in my initial comment.
Innocent until proven guilty. Making a mistake is not a crime. As another example, claiming a significant p-value whereas the true p-value is not significant is not misconduct in many cases (if it were, you'd have to prosecute thousands of professors in biology or medicine).
As was already stated, it's hard to have this discussion on purely hypotheticals because you're omitting context (was it carelessness? was it ignorance? was it deliberate? etc.).
ohh okay so basically you are saying that is not misconduct but it is some sort of cheating that should not be done? I mean if the paper is corrected, it would already be published in a high impact journal, right?
So, when you describe p value hacking, does it need to be retracted? Let's say when the datasets that were used to train the models were not consistent, there is a new version of the dataset that is “easier” (due to random seed, and slightly different codes), but they used two datasets interchangeably and pretend it as one dataset in the paper. Is this a misconduct?
Of course in this case the context is deliberate and intentional to get the paper into high impact journal.
If you're deliberately cheating to get into a high impact journal, then yes, it's misconduct.
If, even with a correction, the main conclusion can't hold true, then that's when retraction is recommended.
For p value hacking, again, that's why I mentioned in my original comment: if "the effect is severe". If your type I error is slightly inflated because you did some peeking, it's bad practice but won't be retracted. Unless of course, based on that study, you conclude that vaccines cause autism or something crazy like that.
Is the story that I described where someone fine-tuned based on an external independent validation set deliberately to get into a high-impact journal considered misconduct? The context is intentional.
Because it's very easy to get into high-impact journals that way, we can have our own "interdependent external validation set", tune our model based on that, and compare it with another method that is only tested without fine-tuning on that dataset. This way, we can always beat the best method.
Yes, that would likely be misconduct.
Okay great.. at least we agree on something after going back and forth for a while..