Forum:Reproducibility Crisis
2
1
Entering edit mode
1 day ago
ParastooA ▴ 10

I have a question. I've encountered a problem, and perhaps you might know the answer. Unfortunately, when examining articles with bioinformatics foundations and working on their databases, sometimes, even in articles published in very high-impact journals, I don't obtain the same results as reported in the articles when applying their analytical filters to p-values, adjusted p-values, and fold changes in R. This is particularly noticeable with GEO datasets and especially CircRNA data. What could be the reasons for this? I would greatly appreciate your guidance if possible.

R-Bioinformatics • 384 views
ADD COMMENT
1
Entering edit mode

Just a comment here, even at the lowest computational level you could find output differences by how the processor was build. You could see unconsistancy between Mac M1 and M2 chips for example.

ADD REPLY
0
Entering edit mode

"unconsistency" is likely not going to occur. It is possible that results would not be identical but they will not be completely different. Time required for execution may be significantly different because of differences in hardware as a whole (memory etc) that will go with the architecture.

ADD REPLY
0
Entering edit mode

If the results are not identitical they cannot be called consistent.

I was thinking about this kind of issues

ADD REPLY
1
Entering edit mode

If the results are not identitical they cannot be called consistent.

In terms of numeric precision yes. But we already know that to be the case because of non-deterministic outcomes from software.

ADD REPLY
5
Entering edit mode
1 day ago
GenoMax 150k

What could be the reasons for this?

Many. At a minimum not using the same exact versions of software, command line options, input data etc. Even if you are doing all that, software may generate non-deterministic results unless it is designed to produce the same exact result by deterministic means (seed values etc). In addition, there exists a possibility that the published analysis is erroneous. Would not be the first time that has happened.

This issue is as old as bioinformatics. People continuously work to address/mitigate this. Use of containers, online execution environments, virtual environments are all trying to minimize this. Easier said than done in practice, since this requires additional effort/expertise/resources on the part of researchers. Not always readily available.

ADD COMMENT
1
Entering edit mode

This pretty much covers all the common reasons. I will add an anecdote to further illustrate the difficulty. Not too long ago I was training a student to do a simple experimental procedure that has 5 steps and takes about 3 minutes to perform. I would demonstrate and explain while the student was watching, and was expecting them to do things the same way immediately after me. It took us three repeats before they got it right, and this is not something that requires extreme manual dexterity.

Trying to reproduce what someone else did, from a procedure that likely hasn't been described in great detail and involves many steps, is going to be very difficult. I would like to think that you didn't get wildly different results, because that would indicate either you doing something very wrong, or the published results being questionable. Yet some variance is results is expected and likely innocuous.

ADD REPLY
0
Entering edit mode

Without going into the details of what "reproducing" means, in my opinion the main source of discrepancy between published and reproduced results is that inevitably researchers published one or only few of the very many results they have or they could have obtained, this is particularly the case in genomics. The selected results tend to be those that for one reason or another are more surprising and extreme. I don't think one need to invoke an intention to cheat or sloppy behaviour, it's just in the nature of research itself that you pay more attention to surprising results and you think of reasons why they make sense. If you try to reproduce those results you would need follow the exact steps of the authors, but for reasons explained by GenoMax and Mensur that is far from trivial. Small deviations from the authors' method could result in a relevant discrepancy exactly because the published results are towards the tails of the population of possible results.

ADD REPLY
3
Entering edit mode
1 day ago

My position on this in general is that the details of an analysis matter in terms of the specific results, and the chance of someone being able to reproduce that exactly, without the use of containerised analysis pipeline or VM is pretty minimal.

However, the general conclusions of a study should be robust to these sorts of differences. So perhaps one analysis gives you 200 up regulated circular RNAs, and another gives you 100. In both cases the fact that circRNA expression is up regulated remains in both cases. And perhaps, while there are differences in precisely which circRNAs are upregulated, they might be enriched in binding sites for the same miRNAs.

Of course, if all this isn't the case, it probably points to the conclusion that their aren't yet sufficiently robust circRNA analysis methods to draw conclusoins about the roles of circRNAs in the various processes which papers tend to claim a role for them.

ADD COMMENT

Login before adding your answer.

Traffic: 2327 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6