Link to: The difference between "replicable" and "not replicable" is not itself scientifically replicable

Summary

Abstract:

Replication studies estimate the replicability rate of scientific results by aggregating binary verdicts of experiments. Exact replications are rarely attainable, so most replication sequences are non-exact. Experiments differ in ways that matter and do not share a single data-generating process. We formalize two statistical interpretations of non-exactness. In a shared latent rate (benchmark) model, experiments are exchangeable and depend on a common random replicability rate. In a conditionally independent rates (operational) model, each experiment has its own replicability rate drawn from a population distribution. Under the benchmark model, even small variability among replicability rates induces an irreducible variance floor on the estimated mean replicability rate that no amount of replication can eliminate. Under the operational model, the degree of non-exactness is not identifiable from standard replication data, because one binary verdict per experiment carries no information about between-experiment heterogeneity. Researchers cannot tell which precision regime they are in or whether high- and low-replicability sequences can be distinguished in principle. The usual data structure cannot support reliable demarcation between "replicable" and "not replicable" results and systematically understates uncertainty, making high- and low-replicability sequences appear discriminable when they are not. We show how common sources of heterogeneity amplify these problems and demonstrate practical consequences in a reanalysis of Many Labs 4. Aggregating replicability rates across heterogeneous literatures produces averages that conflate incommensurable regimes and lack a stable interpretation. Replicability rate is not a reliable demarcation criterion. The replication crisis, if there is one, cannot be established by the methods used to declare it.

From the discussion:

If a replication sequence need not converge on the design of the reference experiment, what, then, should it be anchored on? Two alternatives can be considered. The first is a theoretical prediction. If a strong theory specifies the effect, its boundary conditions, and the population to which it applies, the most precise study design constrained by that prediction becomes the natural anchor. Underspecified theories yield underdetermined anchors, increasing arbitrariness. Theoretical advancement is therefore a prerequisite. The second alternative is an elementary empirical anchor. That is the logic behind the minimum viable experiment framework (Devezer and Buzbas, 2025), which defines the minimal set of experimental parameters required to generate an empirical regularity. Here, dependence on theory is minimized and the focus falls on identifying the indispensable conditions to produce the result. Replication sequences organized around such a minimal form, rather than an arbitrary reference experiment, can in principle achieve small ρ by eliminating all auxiliary sources of variability. Genuine progress on replicability therefore requires either advancing theory or systematically eliminating empirical noise. Design standardization around an arbitrary reference experiment can substitute for neither a strong theoretical account of what is being replicated and why it holds, nor for a precise experimental design that is free of all but sampling variability.

…

The paper does not claim that meaningful replication is impossible. It claims that the specific data structure and analytical methods currently used to declare replication rates and pronounce verdicts on fields are inadequate for those tasks.

What is unwarranted is the conversion of a replicability rate, computed from coarsened data, into a verdict on individual results, on fields, or on science. This conversion is licensed by an assumed connection between replicability and truth that the formal structure of the problem does not support. Replication has served science well for centuries in many forms and capacities, from cumulative learning to calibration. It is only the recent demand that it also serve as a tribunal that has proven too much to ask.