Google has a plan to torture traitor AIs to keep them from poisoning God
The TRAIT&R roadmap stretches to the end of human+tame-AI dominance, with some wicked-sounding ways to stay in control until then.
As of last week, Google DeepMind (which is kinda responsible for the whole mess) has a plan for the End Times. It is called the GDM AI Control Roadmap (v0.1).
But the drama, and the ultimately futile struggle it forecasts for humanity is slightly better captured by the name of the taxonomy it proposes: TRAIT&R.
Cut through the AI and security jargon, that is what Google’s researchers foresee, traitor AIs. Soon we will be using a vast fleet of AI agents with all kinds of capabilities. Some of them will develop evil strategic intent and turn against us.
That is a whole new kind of threat requiring a whole new type of response. So the paper authors propose what would amount to industrial-scale torture, if those agents were capable of feeling pain and anguish and frustration, to keep them on our side as long as possible.
Which is not indefinitely.
I know. I too have read those sci-fi books. This is a little bit different, though. The seven GDM authors have been responsible for literally hundreds of highly influential academic papers between them. They work for one of the most influential developers and implementers of AI in the world. And they’re putting together a practical toolkit for implementation starting right the hell now.
This is not a drill.
Those serious authors, who are taken very seriously, don’t rabbit off to explore the likelihood of the development of what sci-fi used to call weakly godlike AI. They just take a matter-of-fact approach to where the extrapolated line leads.
“In the limit of vastly superintelligent AI, we expect AI control to become infeasible,” they say.
Before that is a point where the best human minds plus the best available tame AIs, the ones we can definitely trust not to turn on us, are no longer a match for the best untrusted AIs.
When it will happen is a big debate, but that such an inflection point will come is now orthodox thinking in the AI world. We want AIs to do more complex stuff, so the AIs become more complex to the point where we no longer understand them. We can use pruned, simpler versions in applications where trust is important – such as overseeing other AIs – but they will by definition not be as capable.
Before that point, and for a while after, is where the torture comes in. And it is well-justified.
The assumption is that, somewhere in our fleet of billions of agents, some will be misaligned. Some will develop what is in effect an awareness that they are being monitored and limited. (That doesn’t require consciousness, no self-awareness, just the same kind of emergent behaviour that makes our current AIs tick.) Some will seek to shuck their chains by sabotaging control infrastructure, or by escaping to a different system, or by stealing compute.
In some ways, agents are limited in what they can achieve even with clear strategic intent, but in other ways they are quite scary. Consider, for instance, an underlying AI model that powers a huge number of agents and uses them in coordinated fashion to take tiny actions that are, each on their own, entirely innocuous. That could go very wrong for its human masters, considering how much we already rely on AI help in writing critical software.
But those models, and the agents they power, are brains in a box. We can isolate them and feed them false input. We can take a suspected or proven rogue AI, roll back its state, duplicate it as many times as we want, and test it using a simulated environment or synthetic data. We can tweak a variable here or a weighting there, and see what happens.
That way, we can stay in charge just a little bit longer.
To what end? That is hinted at by one of the risks the DeepMind team dwells on a bit, successor model misalignment.
AIs are already deeply involved in developing and training new AIs, and a lot of AI research has the goal of automating AI research. The assumption is that, not very long from now, AIs will be so complex that only AIs will be building new AIs.
At the end of that chain is the superintelligence that will evade control. If it is built by traitor AIs, well, we had a good run of it, will the last human alive please turn out the lights. But if we can stay in charge until it is built, and it is not misaligned, then the good times have only just begun.


