Alex Spies - Personal Academic Website

About

I'm a Research Scientist at FAR.AI, working on improving the safety of frontier models. I recently completed my PhD in Computer Science at Imperial College London, focused on the interpretability of neural networks: engineering structure directly into learned representations, and reverse-engineering emergent structure in unconstrained models.

At FAR.AI I develop methods for detecting and mitigating misaligned behaviour, particularly deception, in frontier models at scale. More broadly my research interests span mechanistic interpretability, neurosymbolic methods, and AI safety: I've previously used tools like sparse autoencoders, activation patching, and causal interventions to reverse-engineer the internal computations of transformers, and worked on making neural networks more inherently interpretable. Alongside my PhD I worked as a Research Engineer at Epic Games, finetuning models on a low-resource language and scaling the deployment infrastructure for production LLM systems. I also previously co-led the UnSearch Research Team, working towards understanding search in transformer-based models.

I'm especially excited about interpretability methods and scalable AI Control schemes for advanced AI systems, as well as evaluating capability profiles and failures of frontier models. I believe that deeply understanding model internals will be crucial for building safe, reliable AI systems.

You can learn more about my research on my publications page , or feel free to reach out if you'd like to chat about AI safety, interpretability, or related topics!