Against Moloch
May 11, 2026

Monday AI Radar #25

Can the metrics keep up with the models?

Wide elevation view of a vast wall-mounted plotting chart, seen from directly in front. The chart fills most of the frame and is divided into two regions with visibly different grid densities: a large rectangle of finely-ruled slate-blue grid occupies the lower-left, while a coarser grid wraps around it in an L-shape — running across the upper portion of the chart and continuing down the right side. A single fine slate-blue trendline rises from the lower-left of the fine-grid region in a gently accelerating curve, crosses the boundary at the inner corner where the coarse-grid L meets the fine-grid rectangle, continues briefly onto the coarse-grid region, and terminates cleanly within the chart paper. A small amber-gold starburst marks the precise point of the crossing. On the right side of the frame, a tall scaffolding platform stands in front of the wall: three small human figures in slate blue work on the scaffolding doing finishing tasks — one runs a smoothing roller along the seam, another inspects a section of the new ruling, a third passes a tool. A simple ladder rises from ground level to the scaffolding. At ground level, two more figures stand back from the chart: one gestures toward the seam with hand raised, the other holds a rolled bundle of paper prepared for a future extension. The chart is held to the wall by dark grey brass clips at regular intervals along its top and side edges; vertical support beams behind the chart and a horizontal track above suggest a working installation. Along the lower edge of the frame, a parallel rule, a pair of dividers, an open ink pot, and a T-square rest on a low shelf. All primary linework is in fine slate blue (#334155) on an off-white linen-textured background; the small amber-gold starburst is the only non-slate color in the image.

We’re racing toward AGI, but at least we have good metrics to guide us, right?

METR’s Time Horizon 1.1 is saturating, and we don’t have a clear replacement. While Claude is making rapid progress on Anthropic’s current misalignment assessments, those assessments won’t be sufficient for AGI. All is not lost: we have two pieces from Epoch discussing strategies for future benchmarks, and a promising new benchmark for assessing whether AI is a good influence on people. Beyond formal evaluations, Andon Labs reports on a (partly successful) experiment in having AI run a coffee shop.

Top pick

Teaching Claude why

What’s behind Claude’s recent rapid progress on Anthropic’s misalignment assessments? Anthropic shares some new findings—the headline is that they’re getting good results from training Claude to reason about moral principles and its constitution.

They find that training against a particular bad behavior (like blackmail) is effective at reducing that behavior, but does not generalize well to other forms of misalignment. On the other hand, training on moral principles generalizes well to a range of different situations. Moral reasoning specifically is important: merely using examples of good behavior isn’t sufficient. These results are further evidence that Anthropic’s approach to alignment is working well, at least at current capability levels.

There’s another interesting finding buried in the article: they believe the misaligned behaviors they were investigating originated primarily in pre-training rather than misaligned rewards during post-training. That’s encouraging since failures during RL are considered a likely source of future catastrophic misalignment.

Related: Zvi collects some recent Twitter discussions about how Anthropic sees Claude and itself.

Forecasts

Is AI 2027 Coming True?

Steve Newman asks whether recent progress is keeping pace with the AI-2027 scenario, exploring three perspectives:

The dismissive perspective is obviously wrong, so he focuses on how to tell whether we’re on an explosive trajectory or an “it’s complicated” one:

It’s under-appreciated that the explosive and it’s-complicated views do not diverge much in their predictions until the advent of “strong AGI”.

Steve suggests five leading indicators to watch:

It’s characteristically strong work, though if we get to recursive self-improvement, the answer to all the other questions quickly becomes “yes”.

Three views on the future of work

In a similar vein, Teddy Tawil at Carnegie Endowment identifies three schools of thought on how AI will impact the job market:

All those disagreements, he argues, can be distilled into two core disputes:

  1. How quickly will AI gain capability and diffuse into the economy?
  2. To what extent will AI create new jobs?

The employment data remain ambiguous, which is consistent with all three schools of thought. That’s likely to continue until right before we see substantial changes in the data. If we start preparing for job impacts when we know for certain what they’ll look like, it’ll be far too late.

Benchmarks

Task-Completion Time Horizons of Frontier AI Models

METR estimates that Claude Mythos Preview has a time horizon of at least 16 hours at a 50% success rate.

Scatter plot from METR titled
It’s still the most important chart in AI, but…

Time Horizon 1.1 cannot reliably measure time horizons beyond 16 hours, so the 50% success version of the benchmark is on the verge of saturation. Even on the 80% success version, Mythos comes in at 3 hours and 6 minutes—at current doubling rates, that might also saturate before the end of the year.

I’m hopeful that METR will find ways to measure longer time horizons. If not, the Epoch Capabilities Index (ECI) is the most obvious replacement candidate for “most useful chart in AI”.

Our AI started a cafe in Stockholm

Hot on the heels of their recent AI-managed storefront in SF, Andon Labs has opened an AI-managed cafe in Stockholm.

This is an expensive but useful type of experiment: managing a store requires navigating real-world chaos in a way that’s hard to duplicate in a benchmark. For now the AIs do surprisingly well, but are not yet close to human performance. These experiments are on my list of likely leading indicators for significant job displacement.

Are AI benchmarks doomed?

We’re facing a minor benchmark crisis: many of our best benchmarks are close to saturation, and frontier models are good enough that it’s increasingly difficult and expensive to create replacements. Epoch brings us two explorations of what might come next.

Greg Burnham suggests we can still create useful benchmarks by giving up any one of four attributes of current benchmarks:

Tom Adamczewski, Greg Burnham, and Anson Ho are optimistic about the feasibility of creating new AI benchmarks. They explore a variety of solutions including benchmarks that scale easily (like MirrorCode), “bite-sized benchmarks”, and benchmarks that are expensive but valuable enough to be economically viable.

I expect we’ll continue to have useful benchmarks but they’ll look quite different from current benchmarks, just as Time Horizon looks different from GPQA.

Alignment and interpretability

Yoshua Bengio’s Scientist AI

80,000 Hours’ Rob Wiblin talks with Yoshua Bengio about his Scientist AI concept:

a mind that has internalized the laws of nature and uses them to make predictions, but without predilection about how things unfold. It is a highly intelligent machine that uses probabilistic reasoning to understand the world, but with no hidden goals or preferences. Its predictions are transparent, auditable and verifiable.

This seems like a long shot, but worth pursuing. Smart money says current LLM technology can probably go all the way to superintelligence, but that’s not guaranteed. We are under-investing in alternative approaches like Scientist AI.

Cybersecurity

How good is Mythos?

Point Estimate asks whether Mythos is a big leap forward in capability, or roughly on trend, including for cyber capabilities. They tentatively conclude that Mythos is probably one to two months ahead of the trendline but does not represent a major discontinuity (but see below).

Hardening Firefox with Claude Mythos Preview

Mozilla provides further evidence that Mythos is a big deal:

Just a few months ago, AI-generated security bug reports to open source projects were mostly known for being unwanted slop. […]

It is difficult to overstate how much this dynamic changed for us over a few short months. This was due to a combination of two main factors. First, the models got a lot more capable. Second, we dramatically improved our techniques for harnessing these models — steering them, scaling them, and stacking them to generate large amounts of signal and filter out the noise.

Bar chart titled
One of these months is not like the others

Seemingly high-quality analyses keep coming to different conclusions about Mythos’ cybersecurity capabilities. My best guess is that it’s modestly ahead of the trendline on raw capabilities, but is also the first model to cross a critical capability threshold for real-world impact. If that’s the case, we should soon see reports of other models generating large numbers of high-value vulnerabilities on major projects.

Strategy and politics

AI is changing our minds. When is that a good thing?

We know that AI can change people’s opinions, but how can we tell if it’s having a positive influence? Writing for Cosmos Institute, Maximilian Kroner Dale, Paul de Font-Reaulx, and Luke Hewitt propose a clever metric based on deliberative polling.

Deliberative polling brings together a group of citizens to deliberate on a specific topic. They are given a high-quality briefing and access to an expert who can answer questions about the topic, before deliberating in groups. At the end of the process, they are polled on their opinions. Deliberative polling is widely considered to generate better policy recommendations, while remaining true to the citizens’ values and goals.

DeliberationBench measures whether an AI influences a user’s opinions on a topic in the same direction as deliberative polling. They find that several frontier AIs do indeed move users’ opinions in the same direction as deliberative polling, which is an encouraging sign (although it does not reduce partisan polarization the way deliberative polling does).

The value of deliberative polling is supported by many years of data: using it as a baseline for positive influence is a smart move. I’d love to see more work in this direction.

The AI ad-hoc prior restraint era begins

Zvi contemplates the possibility that the White House might impose some kind of prior restraint on new model releases. Pre-deployment vetting would be great—capricious political interference would be actively harmful. This is exactly correct:

If implemented well, this could be the right thing.

By default, it won’t be implemented well.

A voluntary path to pre-deployment AI vetting

Dean Ball and Kevin Frazier explore options for pre-deployment vetting of frontier models. They conclude there is no clear legal basis for the White House to mandate vetting, but argue that the Center for AI Standards and Innovation (CAISI) is well-suited to administering a voluntary vetting system.

Voluntary vetting is likely the best-achievable option in the short run: it would be immediately beneficial and would provide a good framework for more extensive vetting and regulation in future. It has the further advantage of being less capricious and more predictable than ad-hoc vetting by the executive branch would be.

China

Notes from inside China's AI labs

Nathan Lambert reports back from his visit to China. Culture, as they say, eats strategy for breakfast—it’s rare and valuable to get a peek inside the culture of the big Chinese labs.

Two of his observations fit my previous understanding:

And two of them I wasn’t expecting:

How to buy cheap Claude tokens in China

Claude is popular in China, which is strange since both Anthropic and the Chinese government prohibit its use there. What’s going on?

ChinaTalk has a fascinating article about the extensive and sophisticated ecosystem of “transfer stations” that illicitly resell Claude inside China. The system is capable and resilient, but full of all kinds of creative fraud.

It’s a fun read and a good case study in the difficulty of imposing top-down control on an online service.

Briefly

Outsourcing to AI

kache:

you can outsource your thinking but you cannot outsource your understanding

Correct, no notes.

Task substitution and uplift

METR offers an elegant way of thinking about how AI will increase real-world productivity:

uplift on old tasks <= uplift in value <= uplift on new tasks

Genesis AI robotics demo video

Impressive. Demos always show a skewed version of what a product is actually capable of. More interesting than the content of any specific demo is the rate of progress between demos: like AI, robots are improving quickly.