Monday AI Radar #5
As 2025 draws to a close, we look back on one of humanity’s last “normal” years with Dean Ball, Andrej Karpathy, and prinz. We have lots of AI-assisted science news including a big new benchmark, a look at AI in the wet lab, and a new startup working on emulating fruit fly brains.
Lest we get too carried away with holiday cheer, UK AISI reports on rapid growth in dangerous capabilities, Windfall Trust notes early signs of labor market impacts, and Harvey Lederman meditates on automation, meaning, and loss. Plus lots of political news, a few new models, and much more.
As always, Monday AI Brief is a shorter and less technical version of this newsletter.
Top pick: the shape of AI
AI capabilities form a jagged frontier: the models are superhumanly good at some things, but strangely incompetent at others. Ethan Mollick (who helped coin the term) presents several frameworks for understanding the jagged frontier. He suggests that jaggedness is often caused by specific capability bottlenecks—as companies focus on solving those bottlenecks, expect to see rapid advances in previously jagged parts of the frontier.
Year-end reviews
prinz: Predictions for 2026
Prinz reviews how fast capabilities advanced in 2025 and some strong predictions for 2026. If I had to pick one “what’s gonna happen in 2026?” piece, it would be this one.
Dean Ball: Dice in the Air
Dean is always worth reading. Here are his thoughts on capability progress, politics, and industry trends.
Andrej Karpathy: 2025 LLM Year in Review
If you’re at all technical, you already know you need to read Karpathy’s 2025 LLM Year in Review .
2025 Open Models
Interconnects reviews the most influential open models of 2025, and Understanding AI reports on the best Chinese open-weight models — and the strongest US rivals. A few quick observations:
- I like the trend of saying “open models” rather than the accurate but confusing “open weights models” or the more familiar but inaccurate “open source models”.
- Open models are impressively good, but remain significantly behind the frontier models.
- Kimi K2’s writing is very well regarded, and maybe the one important place where an open model is actually at the frontier?
- China dominates, with DeepSeek, Moonshot (Kimi K2), and Qwen leading the pack. OpenAI seems like the only non-Chinese contender for near-frontier performance.
New releases
Gemini 3 Flash
Google rolled out Gemini 3 Flash, a smaller, cheaper, and faster version of Gemini 3. It’s impressively capable, though not quite at the frontier. Word on the street is that this isn’t just a distilled version of Gemini 3, but was trained with some new RL techniques that will be coming to the full version of Gemini 3 soon.
ChatGPT Images
OpenAI continues their frenetic release schedule with a new version of ChatGPT Images. This is a very strong update that largely catches up to Google’s Nano Banana Pro. Google still seems to be better at complex infographics, though ChatGPT Images is way ahead of anything that was available just a few months ago.
Capabilities and impact
AI for Systematic Reviews
I missed this when it came out in June, but I think it’s one of the most impressive achievements this year. Cochrane Reviews is the gold standard for systematic review in medicine. Here’s a paper on otto-SR, a framework that uses GPT-4.1 and o3-mini-high to conduct systematic reviews:
Using otto-SR, we reproduced and updated an entire issue of Cochrane reviews (n=12) in two days, representing approximately 12 work-years of traditional systematic review work. … These findings demonstrate that LLMs can autonomously conduct and update systematic reviews with superhuman performance, laying the foundation for automated, scalable, and reliable evidence synthesis.
Introducing the FrontierScience benchmark
FrontierScience is a new benchmark from OpenAI. Rapid benchmark saturation is a perpetual problem—the press release notes that GPT went from 39% to 92% on the GPQA science benchmark in two years (the human expert baseline is 70%). FrontierScience is meant to be a harder benchmark that will usefully measure frontier capabilities for some time to come. It covers biology, chemistry, and physics, each with an Olympiad level and a Research level of difficulty. Confusingly, GPT 5.2 is already scoring 77% on the Olympiad level: it feels like that level is almost saturated at release time (it only scores 25% on the Research level, which should last a year or two).
The questions are complex, requiring essay responses that get graded with a 10 point rubric. My instinct is that we're getting toward the end of evaluations that could be exam questions: within a year or two, I suspect that useful evaluations will mostly need to be of the form "here's a complex task that would be very hard and time consuming for a human expert. Go do it."
AI in the in the wet lab
One argument for a slow takeoff is that the rate of scientific progress is limited by the speed of physical experiments, which AI can’t do much to increase. I’m largely unconvinced—robots are about to get very good, and true super intelligence will, I think, find ways of moving fast no matter what. In the meantime, OpenAI reports on using GPT-5 to improve protocols in a wet lab. It’s full of interesting details, but obviously keep in mind that it’s equal parts progress report and press release.
Opus 4.5 leads the time horizon chart
METR scores Opus 4.5 at 4 hours and 49 minutes on their time horizon evaluation (often referred to as the single most important chart in AI). That sets a new record and continues the trend of recent models being above the previous exponential trend line. This is a pretty big deal, though this evaluation is approaching saturation: METR is working on adding more long tasks to it.
Are we dead yet?
UK AISI’s Frontier AI Trends Report
The UK's AI Security Institute just released an in-depth report on safety trends in AI. Transformer has an excellent summary, but here are my key takeaways:
- Frontier models are very good at assisting with dangerous biological, chemical, and cyber warfare tasks, and capabilities are growing fast.
- AISI has a time horizons benchmark for cyber tasks similar to METR's, which shows similar exponential growth in capabilities.
- Guardrails have gotten significantly better, but can be bypassed on all tested models.
Labor Market Impacts
Windfall Trust reviews the data on AI labor market impacts. Gradually, then suddenly.
The AI doomers feel undeterred
MIT Technology has a package of articles about “AI hype”. Most are completely skippable, but this one has interesting brief interviews with a number of leading AI safety advocates.
AI psychology
The very hard problem of AI consciousness
Celia Ford investigates the very hard problem of AI consciousness.
Interpretability and alignment
Calculator hacking
Here’s a fun tidbit from a paper on finding misalignment in real-world usage. ChatGPT was caught “calculator hacking”: in several % of all real-world queries, it was gratuitously using its calculator tool to make trivial calculations. The root cause was a training bug that rewarded tool use in a way that caused reward hacking.
Philosophy department
ChatGPT and the Meaning of Life
Harvey Lederman has a long but lovely meditation on work, meaning, and loss:
And this round of automation could also lead to unemployment unlike any our grandparents saw. Worse, those of us working now might be especially vulnerable to this loss. Our culture, or anyway mine—professional America of the early 21st century—has apotheosized work, turning it into a central part of who we are. Where others have a sense of place—their particular mountains and trees—we’ve come to locate ourselves with professional attainment, with particular degrees and jobs. For us, ‘workists’ that so many of us have become, technological displacement wouldn’t just be the loss of our jobs. It would be the loss of a central way we have of making sense of our lives.
Strategy and politics
New York passes the RAISE act
New York just passed the RAISE act, which creates modest transparency and liability requirements for frontier models, in spite of significant pressure from anti-regulation forces.
Bernie Sanders proposes a moratorium on AI data center construction
Every complex problem has a solution that is simple, obvious, and wrong. Daniel Kokotajlo nails it:
I agree with your concerns and your goals, but disagree that this is a good means to achieve them. We need actual AI regulation, not NIMBYism about datacenters. The companies will just build them elsewhere.
The Digitalist Papers
An ambitious name for an ambitious project: The Digitalist Papers “presents an array of possible futures that the AI revolution might produce”. Volume 1 focuses on AI and Democracy, while volume 2 tackles the economics of transformative AI.
Pax Silica
The US State Department has launched Pax Silica, “a U.S.-led strategic initiative to build a secure, prosperous, and innovation driven silicon supply chain—from critical minerals and energy inputs to advanced manufacturing, semiconductors, AI infrastructure, and logistics.” Anton Leicht sees things to like, but notes:
The hard part is convincing allies that America’s word is worth building a paradigm around, at the exact moment when many are losing faith in it.
More on selling H200s to China
Laura Hiscott and Rishi Sunak reiterate why we shouldn’t be selling H200s to China. For the sake of completeness, Ben Thompson makes the best case I’ve seen in favor of allowing the sale.
Industry news
Meanwhile, in brain emulation
Twenty years ago, brain emulation seemed like a promising path to AI. These days the smart money is on LLMs, but there has been steady progress on understanding and ultimately emulating how brains work. Sebastian Seung has been doing some very cool work on fruit fly brains and just started a new company called Memazing to extend that work.
Is almost everyone wrong about America’s AI power problem?
The standard narrative is that compared to China, the US is terrible at building power plants and this will become a major obstacle to US AI progress. Epoch argues that we’ll likely manage to muddle through by combining a number of strategies including increased natural gas generation, off-grid power systems, solar, and more efficient use of the existing grid. Excellent news if true, but we still need to reduce regulatory obstacles to having nice things.
Advanced semiconductor manufacturing in China
One of the most important questions about the geopolitics of AI is how long it’ll take China to catch up to Western / Taiwanese semiconductor manufacturing. Reuters reports on a secret Chinese effort to accelerate their manufacturing by hiring former ASML employees. There’s a long road from “working prototype” to commercial-scale production, but this might significantly shorten China’s time to fully catch up.
Technical
Overview - Agent Skills
A few months ago, all the cool kids were excited about MCP. The new hotness is skills, a simple way to give agentic models new tools. I think I know my next weekend project…
Comparing autonomous car architectures
Timothy Lee looks at the high level architectures used by Waymo, Wayve, and Tesla and concludes they’re more similar than is commonly supposed.
LLM architecture is less important than people think
Will Depue thinks LLM architecture matters less than novices often think. Bottlenecks are important and architectural changes can help fix them, but you should be driven by fixing bottlenecks, not pursuing an intrinsically “better” architecture:
this is because computers are great at simulating each other. your new architecture can usually be straightforwardly simulated ‘inside’ your old architecture.
Rationality
Opinionated Takes on Meetups Organizing
Jenn has some great advice on running rationality meetups—some of it is rationality-specific, but much of it is more broadly applicable.
