The goal of our group is to do research which contributes to solving AI alignment. Broadly, we of course aim to work on whatever technical alignment projects have the highest expected value. Our current best ideas for research directions to pursue seem to be in interpretability (though we make an effort to keep our eyes on the ball by also regularly thinking about agent foundations). Interpretability is broad; our research direction is less broad. Our specific goals within interpretability are described in the Research Agenda section below.

Research Output

Research Agenda

1. Specific directions within the search for features

First, we want to understand the natural units in terms of which a neural net is performing its computation: we want to develop methods to determine the concepts in terms of which a particular neural net thinks. More specifically, we would like to be able to look at the activations of a neural net on a particular input and tell which concepts 'are active' within the net. To this end, we would like to both develop a conceptual framework for even making sense of the above and run experiments to resolve uncertainties about fundamentals of that conceptual framework. Here are three such key uncertainties that we aim to address in our research:

  • Do neural networks have linear representations? In particular, do the concepts in terms of which a neural net thinks correspond to vectors (called features) in activation space, such that the activation on a particular input is a linear combination of the features corresponding to concepts that are active? Or is it just that all the representations we have identified thus far seem linear because we only had tools to find ones which seem linear? (Can we even in principle come up with concrete examples of neural nets with nonlinear representations?)
  • Do neural nets store more features than the dimension of the activation space? Is the superposition picture an accurate description of real models — can we find clear empirical evidence of it? Perhaps more precisely, what fraction of dimensions should we expect to be storing features in superposition, and what fraction should we expect to be storing compositional features? How does all this depend on the model component? Which methods can we use to take features out of superposition?
  • What's the geometry and combinatorics of features? For instance, (when) are compositional features orthogonal to each other? Can we understand and accurately model the forces present during training which give rise to certain feature geometries, and leverage this to say nontrivial things about the features? (For instance, can we say something about how many features a certain kind of model might represent?)

We see answering these questions as a crucial step towards the ambitious interpretability goal of completely reverse-engineering a neural net, as well as towards the less ambitious goal of identifying particular circuits by anchoring them on features (e.g. looking for model components that attend to particular features). We also see this as a step to check whether a model is thinking about problematic things — which could ideally be edited out (though we are aware this could be difficult), and as a way to remove certain capabilities while hopefully preserving certain other capabilities more generally.

2. Unsupervised methods for finding particular features

Our current main focus also has to do with finding features, but it is relatively agnostic towards each of the above theoretical questions. Our current main projects aim to develop methods to find directions in activation space that capture individual important concepts. In particular, we are primarily investigating the use of unsupervised methods for finding concepts for which providing supervised labels would be problematic. The hope is that, even if unsupervised, this can be much more tractable than finding all the features, both because we can leverage details about particular concepts when doing this (for instance, search for features satisfying constraints these particular concepts satisfy), and also because we need to make fewer contentious assumptions like the ones from the previous item. In particular, we are looking into the following directions:

  • Developing technical improvements on CCS for detecting what a model believes. This includes searching for a feature that satisfies additional constraints, considering ensembling methods for getting more signal out of the data, and looking into ways to remove undesirable truth-shaped properties from activations.
  • Developing tools and creating experiments to evaluate whether CCS (and other ELK solutions) are indeed discovering a model’s beliefs, as opposed to e.g. what a particular simulacrum believes, or what the most likely completion would be if one just looked at the single sentence in isolation, or what humans generally believe, or what is true according to humanity’s best scientific understanding.
  • Looking for other concepts by looking for features satisfying corresponding constraints. There are concepts which are either simply difficult to provide any labels for (for instance, the preferences of a model, or its utility function, or what an RL agent with just a policy network thinks the value of a position is), or plausibly labelable in simple cases but not in cases which would distinguish them from other concepts the model would plausibly represent but which we would not want to detect (for instance, for an intelligent model, we know its likely beliefs about simple domains because we know the truth about the domain, but the labels we can provide will fail to disambiguate the model’s beliefs about the topic from its beliefs about human beliefs about the topic).