Alex Spies
Research Scientist @ FAR.AI working on AI Safety
Mechanistic Interpretability & Representation Learning
About
I'm a Research Scientist at FAR.AI, working on improving the safety of frontier models. I recently completed my PhD in Computer Science at Imperial College London, supervised by Alessandra Russo and Murray Shanahan, focused on the interpretability of neural networks: engineering structure directly into learned representations, and reverse-engineering emergent structure in unconstrained models.
At FAR.AI I develop methods for detecting and mitigating misaligned behaviour, particularly deception, in frontier models at scale. More broadly my research interests span mechanistic interpretability, neurosymbolic methods, and AI safety: I've previously used tools like sparse autoencoders, activation patching, and causal interventions to reverse-engineer the internal computations of transformers, and worked on making neural networks more inherently interpretable. Alongside my PhD I worked as a Research Engineer at Epic Games, finetuning models on a low-resource language and scaling the deployment infrastructure for production LLM systems. I also previously co-led the UnSearch Research Team, working towards understanding search in transformer-based models.
I'm especially excited about interpretability methods and scalable AI Control schemes for advanced AI systems, as well as evaluating capability profiles and failures of frontier models. I believe that deeply understanding model internals will be crucial for building safe, reliable AI systems.
You can learn more about my research on my publications page, or feel free to reach out if you'd like to chat about AI safety, interpretability, or related topics!