Monday AI Radar #6
On paper, this was a quiet week: there were no major releases, and no big headlines. Online, though, there’s been a big shift in the vibe since the release of Opus 4.5 a month ago. It’s now undeniable that AI is transforming programming, and it feels increasingly likely that the same will happen to all other knowledge work before too long. We’ll check in with some industry leaders to see how it feels in the trenches.
But that’s not all—we review the latest evidence of accelerating progress, gaze upon the wreckage of once-proud benchmarks, and try to figure out what to do about AI-related job loss. And shoes! If you’ve been wanting more fashion reporting in these pages, today is your lucky day.
As always, you can get this by email or in a shorter and less technical version.
Top pick
The Jevons paradox for knowledge work
Aaron Levie has a great piece on the Jevons paradox for knowledge work. Just as demand for coal increased when technological advances made steam engines use coal more efficiently, Aaron argues that the market for knowledge work will increase as AI makes knowledge work more efficient.
First it ate the programmers
Sholto Douglas
No Priors just collected a set of short predictions for 2026. They’re all interesting, but the internet has been buzzing about Sholto Douglas (at 38:14) in particular:
The other forms of knowledge work are going to experience what software engineers are feeling right now, where they went from typing most of their lines of code at the beginning of the year to typing barely any of them at the end of the year.
…
software engineering itself goes utterly wild next year
Boris Cherny
Anthropic’s Boris Cherny:
The last month was my first month as an engineer that I didn’t open an IDE at all. Opus 4.5 wrote around 200 PRs, every single line. Software engineering is radically changing, and the hardest part even for early adopters and practitioners like us is to continue to re-adjust our expectations. And this is still just the beginning.
Andrej Karpathy
Andrej Karpathy is one of the giants of AI (among other things, he co-founded OpenAI and coined the term “vibe coding”). He speaks for every programmer who’s paying attention:
I've never felt this much behind as a programmer. The profession is being dramatically refactored as the bits contributed by the programmer are increasingly sparse and between. I have a sense that I could be 10X more powerful if I just properly string together what has become available over the last ~year and a failure to claim the boost feels decidedly like skill issue.
Capabilities and impact
Progress is accelerating
Epoch reports that the rate of improvement in ECI (their composite measure of frontier model capabilities) almost doubled starting in April 2024, going from 8 to 15 points per year.
Project Vend: phase two
You’ve probably heard the hilarious stories about Claude running a vending machine at Anthropic and the Wall Street Journal, and the creative ways employees were able to take advantage of it. Here’s a progress report on phase two of Project Vend. Claude isn’t quite ready to put 7-11 out of business, but it’s come a long way. Two interesting observations:
- Anthropic speculates that many of Claude’s problems were downstream of its intensive training to be helpful, which isn’t always appropriate in an adversarial environment.
- They got a lot of mileage from splitting the task of running the vending machine into several roles, each handled by a separate bot. We’ve seen this strategy work well in a number of different domains lately.
Poetiq cracks ARC-AGI-2
Poetiq just scored a new high score of 54% on the ARC-AGI-2 benchmark. Things move fast around here: when ARC-AGI-2 was introduced in March, the best frontier models were only getting single-digit scores on it. While many benchmarks focus on directly useful tasks, this one was “designed to stress test the efficiency and capability of state-of-the-art AI reasoning systems, provide useful signal towards AGI, and re-inspire researchers to work on new ideas”.
Poetiq isn’t a model, but rather a framework that uses other models (in this case Gemini 3 and GPT-5.1). The fact that it performed so much better than the underlying models is further evidence of the capability overhang: current models are capable of doing much more than we (yet) know how to elicit from them. TechTalks has a nice explanation of how Poetiq works under the hood.
Claude in the kitchen
Here are two interesting and amusing experiments using Claude in the kitchen:
- Using a human with a camera as a “robot body”, can Claude navigate an unfamiliar apartment and figure out how to make coffee?
- Given a photo of two recipes, can Claude build a custom app that provides detailed instructions for cooking both recipes simultaneously?
Get the most from your AI
AI for lawyers
prinz has opinions about using AI as a lawyer. For many tasks, AI will be useless until it crosses some critical threshold, at which point it will abruptly become very useful. In this case, GPT got there first:
For legal research and analysis, GPT-5.x Pro is stellar, and GPT-5.x Thinking is very good. All other models (including Opus 4.5, Gemini 3 Pro, Gemini 3 Flash, Grok) are unusable.
Using coding agents for non-coding tasks
Although AI agents like Claude Code are designed with coding in mind, they are highly capable general-purpose agents. Here's a guide to using them to help with philosophical research, though most of the advice is relevant to many fields.
Are we dead yet?
OpenAI prepares for self-improving AI
Dean Ball has revealed the secret research strategy that he and prinz use to figure out what’s coming next in AI:
we listen, with our own ears, to what frontier lab staff say, and we take it seriously
Here, prinz listens with his own ears to what OpenAI is saying about preparing for ”running systems that can self-improve“.
Training Claude to handle mental health crises
There’s been a lot of attention lately on how models engage with people who are having mental health crises. Here Anthropic explains how they train and evaluate Claude for handling some of its most challenging interactions .
Strategy and politics
Sal Khan on retraining displaced workers
Over the next few years, AI will present humanity with some of the toughest challenges we’ve ever faced. It’s a lot easier to identify the challenges than to come up with solutions that would actually work. While most people are oblivious to what is coming, or are focused on solving the wrong problems in the wrong way, a small number of people are engaging thoughtfully with the problem.
Many of those people don’t have all the answers, but they’re doing vitally important work. For that reason, I’ll sometimes highlight a smart proposal that I don’t think will actually work, but which advances the conversation in useful ways. With that in mind, Sal Khan (of Khan Academy fame) proposes that every company benefiting from automation should donate 1% of their profits to a fund that would retrain workers to succeed in the AI future.
Retraining displaced workers is an idea that sounds great on paper, but has historically had mixed results (c.f. studies of the Trade Adjustment Assistance program). In a AI-As-Normal-Technology world, I am skeptical that an AI-focused retraining program would consistently deliver results, but accept that in principle it could be helpful.
But in the world I think we live in, the problem is simply timing. As we approach AGI, the minimum skill required to do a job better than an AI is going to rise—and it’s going to rise faster than retraining can increase anyone’s skill. Does that mean that pretty soon, every useful job will require a level of skill that no human can achieve? Yes, that is exactly what it means.
You’re gonna need a bigger plan.
How the Catholic Church thinks about superintelligence
Paolo Benanti, an AI advisor to the Vatican, shares some thoughts about AI. This isn’t a formal Vatican communication, but my understanding is that it closely reflects official Vatican thinking. AI is clearly a priority at the Vatican, and much of their thinking about it has been very solid (albeit appropriately focused on generalities rather than specific policy proposals). I do worry about things like this, though:
Regardless of their complexity, AI systems must remain legal objects, never subjects; they cannot be granted “rights,” for rights should belong only to those capable of duties and moral reflection.
I expect that AI will soon be entirely capable of “duties and moral reflection”—just as we should not assume that capability went it isn’t present, we must not ignore it when and if it emerges.
Technical
Time horizons for a single forward pass
Here's a very elegant investigation by Ryan Greenblatt that is in many ways analogous to the METR time horizons metric. He created a dataset of math problems, ranked by how long it would take a human to solve them and then scored different models according to the hardest problem they could solve in a single forward pass. Just like the METR chart, he found that capabilities are growing at an exponential rate, with a doubling time of 9 months.
Benchmarking is harder than you think
Epoch explains why benchmarking is hard. I would not have guessed that the details of exactly how you run a given benchmark can significantly affect the score, but apparently that’s the case. The devil is always in the details.
Side interests
Peter Wildeford’s template for a quarterly review + plan
Just as at work, doing structured reviews in your personal life can be a powerful tool for improvement or a complete waste of time. Peter Wildeford shares a very thoughtful system for a quarterly personal review that I'm excited to try out.
The Revolution of Rising Expectations
I recently linked to Scott Alexander’s excellent exploration of the vibecession, which explores why so many people feel financially distressed even when most objective measures of personal financial health seem positive. Zvi just posted a series which addresses the same question, but finds plausible answers that Scott never really got to. He identifies two root causes:
- The Revolution of Rising Expectations: individuals have higher lifestyle expectations than they used to. Further, society has higher expectations: the minimum lifestyle required to be accepted into mainstream society has risen.
- The Revolution of Rising Requirements: legal & regulatory requirements effectively require individuals to purchase more housing / childcare / healthcare than they used to, or might currently want to.
I specifically recommend The Revolution of Rising Expectations, but the full series includes The $140,000 Question and The $140,000 Question: Cost Changes Over Time.
Avoid zero sum people
Luke Bechtel explains how and why:
But sometimes it’s more active than that. They genuinely believe they can’t move forward without someone else moving back. It’s not sadistic, they just think that’s how the math works. They think life is a ranked leaderboard, not a collaborative game. And from inside that belief, certain behaviors just make sense.
Something frivolous
Shoes of Lighthaven: A Photo-Investigation
You’ve probably been losing sleep wondering what kinds of shoes rationalists prefer. You need wonder no longer: Jenneral HQ is here with a comprehensive photo investigation.
