The goal of our group is to do research which contributes to AI Safety. Our main goal is to research, benchmark and develop robust lie detectors for LLMs.
Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection. The benchmark is available on Hugging Face.
This method addresses challenges in identifying relevant knowledge features amidst distracting features in model activations. By clustering and normalizing activation patterns before applying probing techniques, the approach enhances the accuracy of unsupervised probes in extracting intended knowledge. While not addressing all limitations of current methods, cluster normalization shows promise in making unsupervised probing more robust and reliable for understanding the knowledge encoded in language models.
This paper was accepted at the MechInterp workshop at ICML 2024 and the EMNLP 2024 conference.
This work was done in collaboration with ACS. We conducted experiments to test whether large language models (LLMs) like GPT-3.5 and GPT-4 show bias when selecting between items described by AI-generated content versus human-authored content. Using product descriptions and academic paper abstracts, we found LLMs consistently preferred to select items where the descriptions or abstracts were generated by other LLMs, rather than those written by humans. This suggests a potential "anti-human" bias in AI systems that could disadvantage human workers if LLMs are increasingly used for decision-making in economic contexts. Our study raises concerns about fairness and discrimination as AI becomes more integrated into various aspects of society and the economy.
this paper was published in the PNAS journal in 2025.