Mechanistic Interpretability of Deception in Language Models Trained on Social Deduction Games
Published:
Understanding how language models internally represent deception is critical for AI safety. This work investigates whether models fine-tuned on social deduction game transcripts develop interpretable deception representations amenable to linear probing and causal intervention. I fine-tune Llama 3.1 8B with QLoRA on Werewolf Among Us and SocialMaze data under a joint causal language modeling and binary deception classification objective. Layer-wise probing across five training checkpoints (200–800 steps) reveals that linear probes consistently achieve 94–97% deception detection accuracy against a 50–60% shuffled-label control baseline, confirming a genuine learned representation rather than a high-dimensional artifact. Analysis of the probe accuracy profile across layers shows that the deception feature is nonlinearly encoded at the embedding layer but becomes linearly separable by layer 5, thereafter persisting with near- constant accuracy through all subsequent layers. Contrastive activation steering provides causal evidence that this feature governs model behavior: perturbing activations along the identified direction monotonically shifts the deception probe prediction from 0% to 100%, while probe- direction steering confirms that the learned classifier aligns with a causally relevant subspace of the residual stream.
