Research

The goal of our group is to do research which contributes to solving AI alignment. Broadly, we aim to work on whatever technical alignment projects have the highest expected value. Our current best ideas for research directions to pursue seem to be in interpretability. Interpretability is broad; our research direction is less broad. Our specific goal is to research and build robust lie detectors for LLMs.

Latest Work

Cluster-norm for Unsupervised Probing of Knowledge

This method addresses challenges in identifying relevant knowledge features amidst distracting features in model activations. By clustering and normalizing activation patterns before applying probing techniques, the approach enhances the accuracy of unsupervised probes in extracting intended knowledge. While not addressing all limitations of current methods, cluster normalization shows promise in making unsupervised probing more robust and reliable for understanding the knowledge encoded in language models.

This paper was accepted at the MechInterp workshop at ICML 2024 and the EMNLP 2024 conference.

AI-AI Bias

This work was done in collaboration with ACS. We conducted experiments to test whether large language models (LLMs) like GPT-3.5 and GPT-4 show bias when selecting between items described by AI-generated content versus human-authored content. Using product descriptions and academic paper abstracts, we found LLMs consistently preferred to select items where the descriptions or abstracts were generated by other LLMs, rather than those written by humans. This suggests a potential "anti-human" bias in AI systems that could disadvantage human workers if LLMs are increasingly used for decision-making in economic contexts. Our study raises concerns about fairness and discrimination as AI becomes more integrated into various aspects of society and the economy.

Past work: