Research

The goal of our group is to do research which contributes to AI Safety. Our main goal is to research, benchmark and develop robust lie detectors for LLMs.

Latest Work

Liars' Bench

Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection. The benchmark is available on Hugging Face.

Cluster-Norm for Unsupervised Probing of Knowledge

This method addresses challenges in identifying relevant knowledge features amidst distracting features in model activations. By clustering and normalizing activation patterns before applying probing techniques, the approach enhances the accuracy of unsupervised probes in extracting intended knowledge. While not addressing all limitations of current methods, cluster normalization shows promise in making unsupervised probing more robust and reliable for understanding the knowledge encoded in language models.

This paper was accepted at the MechInterp workshop at ICML 2024 and the EMNLP 2024 conference.

AI-AI Bias

This work was done in collaboration with ACS. We conducted experiments to test whether large language models (LLMs) like GPT-3.5 and GPT-4 show bias when selecting between items described by AI-generated content versus human-authored content. Using product descriptions and academic paper abstracts, we found LLMs consistently preferred to select items where the descriptions or abstracts were generated by other LLMs, rather than those written by humans. This suggests a potential "anti-human" bias in AI systems that could disadvantage human workers if LLMs are increasingly used for decision-making in economic contexts. Our study raises concerns about fairness and discrimination as AI becomes more integrated into various aspects of society and the economy.

this paper was published in the PNAS journal in 2025.

Past work:

Finding the estimate of the value of a state in RL agents: We explored methods to identify value estimates in RL agents, focusing on CNN-based PPO agents playing Pong. We adapted CCS loss for unsupervised probing and used supervised probing as a baseline. While we had some success, we found that salient features often overshadowed value information.
GitHub - EleutherAI/ccs: Keeping language models honest by directly eliciting knowledge encoded in their activations. Building on "Discovering latent knowledge in language models without supervision" (Burns et al. 2022). We are main contributors of the elk library. This is an open-source reimplementation of the codebase with added features (parallelization, HuggingFace integration etc.), which we used to replicate the original paper’s results and to run experiments.
Possible ways to expand on "Discovering Latent Knowledge in Language Models Without Supervision": Our initial post, with ideas of how to expand the paper
Searching for a model's concepts by their shape – a theoretical framework: Our suggested theoretical framework expanding the method to concepts beyond truth
Toward A Mathematical Framework for Computation in Superposition
Decomposing Activations into features.
Research YouTube channel