Link to: Language models are weird for the same reason human cultures are weird

Summary

Suppose, as an illustration, that you live in a very different time and place: let’s say you live in a small farming community in prehistoric Mesoamerica, about 5,000 years ago. Though you don’t really see it this way—you have other things to deal with—your world is composed of feedback loops, some of which are more immediate and more obvious than others. For example: if you run in front of a snake, the feedback from your environment will be immediate and obvious. It will be so immediate and so obvious, in fact, that your aversion to doing so will be hardwired into your genes. No one needs to tell you not to run in front of a snake, because your ancestors survived and reproduced in part because they didn’t run in front of snakes.

But there are other feedback mechanisms that are more opaque. For example: the big new thing in your farming community right now is the cultivation of an interesting grain, a wild grass with small, hard kernels. Eventually this will come to be called “maize.” It’s a wonderful food source, since it’s calorie-dense and easy to cultivate; but if you eat it as a staple for a long time without the right preparation, you’ll develop a horrible wasting disease, and your body will start to display such symptoms as cracked skin, diarrhea, and dementia. And after a long time, this disease—people will later call it “pellagra,” from the Italian for “rough skin”—will kill you.

So maize provides a much more opaque feedback mechanism. Eventually you’ll suffer and die from eating it; but it takes so long to set in, and cause and effect are so unclear, that you don’t have any instinctive sense of what to do. (And cultivating maize is a new thing anyway: your primate ancestors weren’t doing it millions of years ago.) And there are all sorts of problems like this. How do you prepare fish and not get sick? How do you pick mates such that your offspring are healthy? If feedback from the environment is coarse and sparse, how do you learn what to do?

Henrich says that you learn through imitation. The true “secret of our success” was our propensity for imitating others, above all imitating those who are successful and visibly competent. At some point in your farming community, or in a farming community in the broader Mesoamerican region, someone will prepare maize in a certain way that involves soaking it in an alkaline solution of water and ash. (We now call that process nixtamalization.) Unbeknownst to them, that process will release otherwise inaccessible nutrients, such that they’ll be able to avoid pellagra. People will notice their success and imitate the practice, while other attempts to ward off pellagra will have failed; and the practice will catch on, and become Mesoamerican tradition.

This is cultural evolution. Scaled across many generations, the result is a kind of slow learning process: adaptive practices are carried forward, since their hosts thrive and are imitated; and maladaptive ones are pruned, since their hosts do not.

But here we encounter another problem. If feedback cycles are long and the feedback itself is coarse, then it’s hard to know why someone succeeded. The nixtamalization process, for example, was bundled with practices that didn’t do anything in particular, like blowing on the maize before putting it to cook or swaddling certain cobs like newborns and letting them sit outside the house all year. But if all you know is that the process as a whole seems to prevent pellagra, then the optimal thing is to imitate the entire bundle of practices. It’s much easier to see that something is working than to intuit exactly what is working.

[...]

And so Henrich says that human culture is shaped not just by imitation but indeed by overimitation: and that overimitation is the source of all sorts of weirdness within every culture. The same tendency for social learning that allowed us to inhabit the most inhospitable parts of the world also resulted in all sorts of eccentricities, things that can’t quite be explained as functional. Some inert or inexplicable thing was bundled with something adaptive, and was imitated along with it; that practice was passed on and inherited by generation after generation; and eventually it ossified into tradition. As a result every culture has its fair share of weird quirks and eccentricities.

[...]

And the result is that human evolution is both a remarkably powerful learning mechanism and a remarkably crude one: prone to creating cultures that are adaptive to their local environments, while also riddled with eccentricities that range from the harmlessly inert to the actively destructive.

[...]

Now let’s consider another adaptive system, of a very different kind: the large language model.

[...]

The pretraining process is extraordinarily dense, with tens of trillions of microcorrections localized to specific tokens; but post-training is quite different. The SFT and RL stages that characterize post-training involve orders of magnitude fewer training events; and because they score entire outputs rather than specific tokens, the feedback the models receive is much more coarse than what they receive from pretraining.

In this way pretraining is akin to biological evolution, and post-training to cultural evolution. And the result is what we’d expect from the logic of adaptive systems: post-training frequently produces eccentricities of all kinds in language models.

[...]

The language models overlearn—overfit would be the more precise term here—for the same fundamental reason that humans do. Overfitting is the Bayesian-optimal strategy in environments of coarse and sparse feedback. If you receive a single reward signal for a complex output and have no way of knowing which features of that output earned the reward, the rational move is to reproduce all of them, including the ones that were incidental.

That, ultimately, is where the goblins came from.

At some point in 2025, OpenAI trained a reward model for the “Nerdy” personality feature on ChatGPT. During that training process, OpenAI’s blog post says, human raters “unknowingly gave particularly high rewards for metaphors with creatures.” The “Nerdy” prompt advised the model that it is “an unapologetically nerdy, playful and wise AI mentor to a human” and must “undercut pretension through playful use of language”; and human raters, asked to score the adherence of outputs to that prompt, consistently gave the model a better score if it mentioned goblins, presumably because such mentions were “playful” and “nerdy.” Simply including “goblin” in a response had positive uplift for the “Nerdy” fidelity rating in 76 percent of cases.

[...]

The goblins are a particularly funny tic: but there are countless others. Opus’s tendency to think that questions phrased in certain ways are word games, or GPT-5.1’s tendency to think that conditional statements (like “if it’s sunny, go for a walk”) demand that the model output code, or Haiku 4.5’s tendency to rebut the claim that “5 + 8 = 13,” are artifacts of the same dynamic.

Language models are truly fantastic learners; but because post-training processes are “chunky,” with feedback sparse and coarse relative to the signal they receive from pretraining, the inevitable result is that they bundle genuine ability with strange tics and behaviors. What was true for the adaptive system of human culture is true for the adaptive system of language models: weird behavioral artifacts are inevitable when a capable adaptive system must learn from sparse and coarse feedback.