Towards Zero

Variance decay, entropy loss, and the cultural cost of recursive models

Aug 20, 2025

The internet is starting to eat itself. That might sound like exaggeration, but it is already visible in the dynamics of how models and people interact. AI systems like GPT, Gemini, and Claude are trained on human-made data: text, images, music, code, all the cultural residue people leave online. Once these systems get good enough, people gradually stop creating from scratch. They rely on the copies instead. Feeds fill with AI-written posts, AI-generated pictures, AI-summarized articles. Those outputs get scraped, tagged, and fed back into the next generation of models. At first this feels like acceleration, but soon enough it begins to collapse into noise.

We can make the problem precise with a simple model. Suppose the original dataset is H, the body of human data. A model trained on it produces a distribution M₀. Now imagine the next model is trained not only on H, but on a mixture of H and the outputs of the first model. Introducing a mixing weight α:

\(D_{t+1} = (1-\alpha) H + \alpha M_t\)

Iterating this forward gives:

\(D_{t+n} \to (1-\alpha) \sum_{k=0}^{\infty} \alpha^k H + \lim_{n \to \infty} \alpha^n M_0\)

As n grows, the system converges. The fixed point depends less on fresh human signal and more on accumulated model outputs. When α is not negligible, the process stabilizes not on truth but on a degraded equilibrium. Shumailov et al. (2023) formalized this effect as model collapse, showing that repeated training on synthetic data leads to irreversible loss of information.

There are other ways to see the same convergence. Each round of training reduces the variance of the distribution.

\(Var_{t+1} = c \cdot Var_t \quad \text{with } 0 < c < 1\)

\(\lim_{t \to \infty} Var_t = 0\)

What begins as a wide space of ideas contracts until it becomes narrow.

Information theory predicts the same. A signal repeatedly transmitted through a lossy channel without fresh injection of entropy eventually carries no information about its source.

\(\lim_{n \to \infty} I(X; Y_n) = 0\)

Training models on their own outputs is this process in disguise. The first generations preserve most of the signal. After enough cycles the information evaporates.

Guo et al. (2023) showed that models fine-tuned on synthetic text lose lexical and semantic diversity. Briesch et al. (2023) found that self-training loops flatten distributions in proportion to how much synthetic data is used. The collapse can be measured directly through entropy shrinkage, perplexity decline, and the flattening of word frequency distributions.

We can already sense it in the real world. Entire feeds on LinkedIn now read with the same cadence, the same numbered lists, the same tone of synthetic motivation. Reddit threads often feel like stitched template sentences. X is crowded with boilerplate hot takes. These are cultural symptoms of variance collapse.

The economics accelerate the process. Using AI to generate content saves time for the individual. But each AI-generated contribution slightly lowers the quality of the shared pool of data. This is the structure of a tragedy of the commons. The individually rational choice is to generate with AI. The collectively rational choice would be to contribute original work. Left uncoordinated, the equilibrium tilts toward degradation.

Proposals like reinforcement learning with human feedback do not solve this. RLHF adjusts outputs to what annotators prefer, but if annotators themselves are immersed in AI-saturated distributions, their judgments only reinforce the loop. Reinforcement without novelty stabilizes, it does not progress.

The larger point is that models can only be as alive as the data they are trained on. A system fed primarily on its own outputs cannot generate new information. It converges toward its own blur. The promise of AI in science, medicine, and technology depends on exposure to fresh, high-entropy human signal. Without it, the long-term trajectory is stagnation disguised as growth.

The only way forward is to keep the loop open. That requires provenance systems that distinguish genuine human contributions, incentives that reward originality rather than engagement metrics, and structures that guarantee fresh data keeps entering the commons. Unlike climate change, where causes and effects stretch over decades, the feedback loop here is fast. Within only a few model generations, the damage can be observed. Which is also why it is still reversible, if the warning is taken seriously.

The choice is simple enough to describe. Either the commons of data remains an open system with continual human signal, or it collapses into self-reference. The future of AI, and much of culture with it, depends on which way we move.

References

Shumailov, I., et al. (2023). “The Curse of Recursion: Training on Generated Data Makes Models Forget.” arXiv:2305.17493.
Guo, Z., et al. (2023). “The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text.” arXiv:2311.09807.
Briesch, M., et al. (2023). “Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop.” arXiv:2311.16822.
Thanks for reading aditi.txt! Subscribe to receive new posts and support my work.

aditi.txt

Towards Zero

Variance decay, entropy loss, and the cultural cost of recursive models

References