Interpretability— forks of useful understanding
Critics of AI progress, notably Gary Marcus, are out in force and loudly, essentially arguing that barriers of interpretability, defined loosely as a lack of understanding of how an AI model actually works, and a corresponding lack of observability, defined obviously as we can’t see them working, are hampering further progress towards AGI.
Simultaneously, voices more sanguine about progress, fork the interpretability question orthogonally, posing that the risk isn’t to investment, capital, and progress, but in safety and guardrails. That is captured in a paper by AI scientist Dario Amodei.
Continue this discussion (and see syntheses of Marcus and Amodei papers) over at nov. link/skoolAI (remove the space)
In a way, both represent existential threats to the current AI moment, if you accept the framing. Marcus pours water on reasoning models, suggesting that they are simply mimicking observed thought patterns from the training data, and are hard-bounded by that. Perhaps, but essentially synthetic reasoning data can provide new untrained thinking patterns, validated by reinforcement learning. (Which is what you are doing when you respond to a prompt, and say, “Hey, you screwed that up”— but also done at a tremendous scale via Mixtures of Experts (MoE), etc.)
Amodei, conversely, is arguing that interpretability is essential to being able to restrain a runaway super AI. Here’s his framing: “Modern generative AI systems are opaque in a way that fundamentally differs from traditional software. If an ordinary software program does something—for example, a character in a video game says a line of dialogue, or my food delivery app allows me to tip my driver—it does those things because a human specifically programmed them in. Generative AI is not like that at all. When a generative AI system does something, like summarize a financial document, we have no idea, at a specific or precise level, why it makes the choices it does—why it chooses certain words over others, or why it occasionally makes a mistake despite usually being accurate.”
Amodei relates this back to the now well-known problem of alignment faking— essentially an AI model lying about what it is doing, or cheating in some way. “For example, one major concern is AI deception or power-seeking. The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments.”
We now have this pretty describable axis. One limit says we don’t know how it is working, so we can’t make it more powerful. The other says the failure to understand must be overcome to curb its power. Amodei thinks we are well on the way to greater interpretability, and lays out several routes in his recent paper.
Clearly, both propositions are true, at least in part. But they diverge in both theoretic and observational ways, ways that will fork interpretability as a core problem in the AI field. That hard fork will have significant challenges for policymakers, users, and investors. Clearly, interpretability is a key issue— and going back to Nelnet in 2017, humans simply don’t know, at a core level, how models are working. If Marcus is right, scaling and training volume aren’t the solution for model improvement, then the billion-dollar trajectories are ROI failures in a massive way, are actually hindering core improvements in AI. Amodei, conversely, finds interpretability necessary to guide future development along safer paths.