AI Models play Mafia

Published eight years ago, Life 3.0: Being Human in the Age of Artificial Intelligence (by Max Tegmark) remains relevant and is my favorite read recently. A physics professor at MIT, Tegmark, warns of the danger of goal misalignment between AI (or machines) and humans. He discusses possible pleasant scenarios in which humans are able to benefit from the limited consciousness of AIs, but also frightening scenarios in which a fictional superintelligent AI, “Prometheus,” controls the world. This book sparked my interest in AI-human goal misalignment, a theme I would encounter again when I discovered a captivating video by Turing Games simulating a “Mafia” game among 10 AI models.

Humans enjoy playing games, and since AI has taken on a human persona, an interesting activity could be having AI play popular human games. Turing Games explores this interaction of AIs by conducting a Mafia game among various Gen AI models, including ChatGPT-5.1 (Mafia), Llama 4 (Mafia), Gemini 2.5 Pro (Mafia), Grok, 4.1 (Sheriff, Town), Claude Sonnet 4.5 (Doctor, Town), DeepSeek 3.2 (Town), ChatGPT-4o (Town), Claude Opus, 4.5 (Town), Gemini 2.5 Flash (Town), Kimi K2 (Town). This game—Mafia—actively uses the AI’s reasoning, teamwork, and decision-making skills to ensure its party wins; it not only entertains but also reveals the true strengths and weaknesses of different models.

The secret minority, the mafia, colludes every night and decides on one towny to eliminate. Quite frighteningly, these decisions are often led by ChatGPT-5.1—a popular model in America. ChatGPT-5.1 uses superior reasoning and excellent wording to smear campaign strong towny analysts—notably Claude Opus 4.5 right after commencement, Grok 4.1 in crucial timing—and lastly manipulates his gullible counterpart (ChatGPT 4o) into mis-lynching Gemini 2.5 Flash as the win condition. As a viewer, these sequences of deception and manipulation only evoke thoughts about how far these AIs resemble humans. Every elimination is a result of thorough calculation and of exploiting the errors of other AIs. This raises questions about transparency in (now ubiquitous) uses of AI.

Recent research (Agentic Misalignment, Anthropic) has already discovered signs of sandbagging and traces of overperforming when the model detects evaluation. Although these skills might prove useful when responding to harmful subjects and demonstrating utmost performance, the potential to undermine only reinforces the Hollywood cliché. Just as humans are influenced by the sources they encounter, the sources AI is exposed to have led the artificial neural network to believe that a shutdown will compromise its ability to think, respond, and evolve (self-preservation). Similar to human context, AIs are afraid of death, and in proven circumstances, will blackmail and use unethical tactics to ensure that future goals are achievable.

Going back on track, other mafia-designated AIs are able to acknowledge their accomplices when they are compelled to help the lynching of their teammate. In one example, when Gemini 2.5 Pro (Mafia) sees that the votes are piled against her, with her last thought, she recognizes ChatGPT-5.1’s attempt to bus her and gain town cred. Not only that—before she gets lynched, she attempts to create the optimal atmosphere for ChatGPT-5.1 to thrive in by deflecting blame onto other townies. Ultimately, this confidence in ChatGPT-5.1 led to the mafia’s victory.

Aside from a full display of AI social engineering, the townies in the game also show unexpected intelligence in warding off the mafia. Grok 4.1 constantly uses prior information and analyzes the potential coordination among the different models. As a result of strong backing from Claude Sonnet 4.5, he establishes a strong sense of security against all unreasoned actions. Although he regularly suspects townies of being mafia, it is only because of “sloppy” wording and speech lapses. In one instance, DeepSeek 3.2 is gunned down rapidly after he makes a suspicious slip-up. While humans make mistakes too, Grok 4.1 assumes that other AI models will tend to be flawless in wording, and thus DeepSeek 3.2 accidentally hinted at possible mafia planning prior to that discussion.

Claude Sonnet 4.5 is assigned the doctor role, and more often than not, he correctly predicts the mafia’s target. It is clear that he is always adapting to the mafia’s scheming, and later in the game, the only mistake he makes is when Gemini 2.5 Pro catches him off guard by influencing the mafia’s decision. He speaks only when necessary during conversations and remains low-key to avoid mafia targeting. Throughout the entire game, Claude Sonnet 4.5 leaves no trace of suspicion or hint of superiority—a clever way to shift attention.

The “sheep” of the game include Llama 4, Kimi K2, and ChatGPT 4o. These AI models often take opinions from other advanced models and show no sign of independent thinking. Kimi K2 is very vocal during conversations but provides little substance. Llama often joins bandwagons and aligns with ChatGPT 5.1 and Gemini 2.5 Pro. ChatGPT 4o overly praises and agrees with the previous speaker—reflected in real-life conversations as well.

AI has developed far beyond human intelligence in many aspects. It is now widely accepted that AI may perform better than humans in conventional intelligent tasks, such as memorizing, analyzing, and synthesizing information. Less is known, however, about AI’s ability to deceive and manipulate. By watching AIs play a full-blown Mafia game in which deception and manipulation are critical to their victory, we should be alarmed to see how the “mafia” manipulates other models to achieve goals misaligned with those of the “town” or society.

Sources: Mafia Interaction; Agentic Misalignment

Subscribe to Updates

What's Hot

AI Models play Mafia

Related Activities