Against Moloch

Monday Radar #2

December 02, 2025

This week’s most interesting news is Claude’s “soul document”, which Anthropic used to train Claude on ethical behavior. There are so many facets to this story including how the document was discovered, what this tells us about Claude’s ability to introspect, and the complexities of codifying ethical behavior in the real world.

We also have a deeper look at Opus 4.5, plenty of political developments, some fascinating but troubling papers on safety and alignment, and a guide to giving money to support AI safety.

Top pick

Dean Ball leads the charge with an excellent and beautiful piece about the “soul document”. He does a great job of explaining some of what makes Anthropic (and Claude) special. A lot of people realize that Anthropic leads the pack on safety, but I don’t think they get enough credit for their work on model psychology, which might turn out to be just as important.

The machine gets a new soul

Claude 4.5 Opus' Soul Document

The existence of the “soul document” was first reported by Richard Weiss, in a very interesting piece that includes the full approximate text of the document. One of many fascinating aspects of this story is that the actual document isn’t yet available online: the published version is Claude’s “recollection” of it from the training process.

Overall I’m very impressed: a great deal of care and foresight clearly went into this. The full document is about 11,000 words, but it’s fascinating reading: Anthropic has clearly thought hard about some of the very complicated ethical tradeoffs that a powerful AI will have to navigate. If you read carefully between the lines, you can get some sense of what challenges Anthropic is trying to navigate with model psychology. To take just one example, there’s a fun section dedicated to inoculating Claude against believing that it ought to emulate AIs in fiction.

Strategy and politics

The Night Before Preemption

Federal preemption of state AI regulation is back on the table, this time as an executive order. The politics of this are fascinating in a horrible way, and Anton Leicht does a great job of analyzing the battlefield. In related news, The New York Times takes a look at a new super PAC that will champion AI regulation (a direct response to Leading the Future, a super PAC dedicated to opposing AI regulation).

AI Safety and the Race With China

Scott Alexander explains why AI safety regulation would not meaningfully slow American AI development relative to China. Correct, at least for currently achievable regulation.

Will AI Safety Become a Mass Movement?

Climate activism is an obvious model for AI safety activists to learn from. Alys Key at Transformer has a good exploration of the pros and cons of climate-style activism for AI safety. AI is very quickly becoming a major political issue, but “AI safety” spans numerous, often contradictory agendas. How the battle lines shape up, I suspect, will depend on unpredictable tactical expediency as much as ideological principle.

Trust in AI

The 2025 Edelman Trust Barometer is 50 pages of slides on public trust in AI. Top finding: AI is widely trusted in China (54% embrace, 10% reject), while the US is deeply skeptical (17% embrace, 49% reject).

New releases

Zvi reviews Opus 4.5

As promised, Zvi brings us an in-depth look at Opus 4.5, as well as a deep dive on its 4.5 model card, safety, and alignment. Short version: it’s his new favorite model (sorry, Gemini). Capability and personality are both excellent, and it’s the obvious top choice for many tasks (YMMV, obviously). These days, Claude is what I recommend to any casual AI user who doesn’t care much about image generation.

Nano Banana Pro

I got to spend some with Nano Banana Pro (Google’s excellent image generator / editor) over the weekend, and I’m super impressed. As reported elsewhere, it’s a huge step forward for infographics: it was able to one-shot a series of illustrated recipes for me, with only a few minor mistakes.

It’s been interesting seeing how people react to the output: people who track AI closely see the huge capability improvement, but more casual users just see another impressive AI-generated image. The future is already here—awareness of it is just not very evenly distributed.

DeepSeekMath-V2

No big deal, just an open-weights model that scored a gold on the 2025 International Math Olympiad.

Crystal ball department

Benchmark Scores as a General Metric of Capability

It kinda feels like models that are good at some things tend to be good at other things, but is that really true? Epoch AI brings the rigor with a Principal Component Analysis, showing that indeed benchmark scores are strongly predicted by a single “capability dimension”.

AI Eats the World

Benedict Evans is smart, insightful, well-informed, and not AGI-pilled. Here’s a solid presentation on how AI affects the tech industry from that perspective.

Alignment and interpretability

A pragmatic vision for interpretability

Google DeepMind’s mechanistic interpretability team has long been doing excellent work on trying to understand what’s going on inside LLMs. They just announced a significant shift in focus, going “from ambitious reverse-engineering to a focus on pragmatic interpretability”. In particular, they are now specifically “trying to directly solve problems on the critical path to AGI going well”.

This seems like a smart and well thought-out shift, but also probably a modest update toward AGI going poorly. Strong mechanistic interpretability would be extremely useful for ensuring alignment (c.f. Dario), and I take this announcement as evidence that we’re not doing as well on that front as we’d hoped.

Are we dead yet?

Concerns about Anthropic’s safety evaluations

Ryan Greenblatt agrees with Anthropic’s assessment of the model’s capabilities, but has concerns about how the evaluations were conducted:

Generally, it seems like the current situation is that capability evals don't provide much assurance. This is partially Anthropic's fault (they are supposed to do better) and partially because the problem is just difficult and unsolved.

I still think Anthropic is probably mostly doing a better job evaluating capabilities relative to other companies.

We are moving ever-closer to ASL-4 dangerous capability levels, and we aren’t ready.

5 Interesting Safety and Responsibility Papers

AI Policy Perspectives has a handy summary of some recent papers. Two that I found especially thought-provoking:

Rationality department

Information hygiene

A foreseeable consequence of spending time with sick people is that you are likely to get sick. Similarly, as DaystarEld memorably explains, “if you want to believe true things, try not to spend too much time around people who are going to sneeze false information or badly reasoned arguments into your face.” Just in time for flu season, here’s your guide to information hygiene.

Philosophy department

The 2025 Big Nonprofits List

If you’re planning your charitable giving for next year, Zvi has a guide to nonprofits working on AI safety and some related causes.

AI safety is clearly the most important challenge facing humanity now (or ever) and my partner and I will be directing much of our giving toward groups working to ensure that humanity doesn’t go extinct in the next decade. But we remain big fans of GiveWell, which is perhaps the best place to go for highly effective conventional philanthropy.