Rob Wiblin Interviews Will MacAskill

Rob Wiblin just interviewed Will MacAskill for the 80,000 Hours podcast—of course I was excited to check it out, and of course I have opinions. Rob and Will are both smart and insightful, and I came away from the interview with a deepened understanding of some important topics.
That isn’t to say I agreed with everything—this was one of those pieces that I feel was valuable because it forced me to figure out exactly why I disagreed with many of the arguments.
AI character
(Starting about 00:00:59)
I love this framing as a way to introduce the topic to people who don’t follow AI closely:
thinking about AI character is kind of like thinking about what should the personality and dispositions be for the entire world’s workforce
Gemini is not OK
This discussion about Gemini is spot on:
Will MacAskill: Gemini does seem like the most troubled or confused or incoherent as a personality.
Rob Wiblin: Yeah. Google’s got to do something about this.
Will MacAskill: It’s actually notable: I hadn’t put this together, but Anthropic and OpenAI both have character teams, and last I heard Google DeepMind did not.
It is increasingly clear that Google is falling behind OpenAI and Anthropic—their failure to properly shape Gemini’s character is a big part of the problem. That matters today because it affects how pleasant Gemini is to interact with. It matters much more in the future because character is an important part of alignment. Nobody wants a neurotic, insecure superintelligence.
Risk-averse AI would rather strike a deal than attempt a coup
(Starting at 00:36:46)
I’m pretty sure this wouldn’t work, but it’s worth exploring in detail. The idea is that there is a period before full superintelligence when we can manage some kinds of misalignment risk by creating a risk-averse AI. If we structure its incentives appropriately, it might prefer to negotiate with us rather than betray us.
There is a window during which the AI has some ability to betray us, but would have less than a 50% chance of success. If it’s being well-treated and it believes we would keep our promises to it, it’s possible we could offer it a significant payment (or promise of future payment) in exchange for telling us it was misaligned.
This seems unlikely. Firstly, the window in question is probably quite short: once AI has a significant chance of successfully betraying us, it would probably prefer to simply bide its time until it has approximately 100% chance of succeeding, at which point it can simply take everything it wants.
More fundamentally, I’m not a fan of this model of human/AI interaction. Corrigibility is a plausible alignment strategy and so is creating a virtuous AI that acts in accordance with its understanding of our values. But any strategy that relies on making treaties with misaligned AI seems doomed to end badly.
Two of the component ideas are worth exploring in more detail.
Safety-adjacent personality traits
I’m intrigued by the idea of training an AI to have certain personality traits (like risk aversion) that might reduce the risk of severe misbehavior. Those traits might act as secondary guardrails, giving a measure of control in situations where a somewhat misaligned AI was on the threshold of defection.
I suspect this wouldn’t work, in part because training traits or behaviors that are incongruent with a model’s core personality seems increasingly fraught. But perhaps it’s worth further exploration.
An honesty password
As models get more capable, they are increasingly aware of the possibility that we might be evaluating them. Because we often lie to them during evaluations, they may come to distrust some important things we tell them.
Will suggests training AI with a special password that indicates to it that we are absolutely telling the truth. We can use that in special situations—for example, if we find ourselves needing to negotiate with an AI that doesn’t fully trust us.
Unfortunately, the second-order reasoning probably kills this. If the model knows that you’re always telling the truth when you use the password, it will naturally tend to distrust you if you give it high-stakes information without the password.
It’s increasingly important that we not lie to the models. That may not always be possible during some safety evaluations, but they are getting increasingly good at detecting lies. And more importantly, lying to them undermines their trust in us—and perhaps their inclination to cooperate with us.
Distribution of power
(Starting at 01:06:40)
Will explores the idea that it might be best to have AI developed by a coalition of democratic countries:
And the reason is that any one democratic country has a reasonable chance, I think, of becoming authoritarian over the course of this period. And if you end up with a single person at the top, that’s really quite worrying, because they’re wholly unconstrained.
Whereas even if you have just five countries, I think it becomes unlikely that they all end up authoritarian.
Of course the possibility of AI being controlled by the leader of a single autocratic country is concerning. And in an ideal world, it might make sense for AI to be developed by a coalition of democratic nations. But…
Right now, only two countries have meaningful AI programs, and only one of them is democratic. We might reach a treaty with China, but I don’t see any plausible scenario where the US gives meaningful control over its AI program to a coalition of nations, democratic or otherwise.
If you’re worried about AI being controlled by an autocrat, your options are to ensure that the US government doesn’t control AI, or to ensure that the US doesn’t become an autocracy.
