Can OpenAI’s GPT-4 help make AI safer? The company’s large language model tried to explain GPT-2 neurons.
In a recent paper, OpenAI shows how AI can help interpret the internal workings of large language models. The team used GPT-4 to generate and evaluate explanations for neurons from its older predecessor, GPT-2. The work is part of OpenAI’s alignment research, which aims to help better understand and guide the behavior of AI systems.
OpenAI’s methodology involves three steps:
- Generating an explanation with GPT-4: Starting with a GPT-2 neuron, GPT-4 generates an explanation for its behavior by displaying relevant text sequences and activations.
- Simulate with GPT-4: GPT-4 simulates what a neuron firing for the generated explanation would do.
- Compared: The generated explanation is scored based on how well the simulated activations match the actual activations of the GPT-2 neuron.
At the end of the process is an explanation of the function of a GPT-2 neuron in natural language, such as “Fires when referring to movies, characters, and entertainment”.
OpenAI’s GPT-4 does a worse job of explaining GPT-2 than humans
The team found that the larger the language model being explained, the worse this method works. One reason may be that neurons in later layers are more difficult to explain. However, the team was able to improve the generated explanations somewhat by using some approaches, such as iterative explanations. In addition, GPT-4 provides better explanations than smaller models – but still worse than humans.
The team generated explanations for all 307,200 neurons from GPT-2 using GPT-4. Among them, they found 1,000 neuron explanations with an explanation rate of at least 0.8, meaning they explained most of the neuron’s activation behavior after GPT-4, according to OpenAI.
According to OpenAI, the methodology has many problems, such as its inability to explain complex neuronal behavior or downstream effects of activations. In addition, it is questionable whether a natural language explanation is possible for all neurons, and the approach does not provide a mechanistic explanation for the behavior of GPT-2 neurons, “which could cause our understanding to generalize incorrectly.”
OpenAI’s alignment research relates on AI assistants
The goal of the research is automatic interpretability methods that the company plans to use to check whether language models are misaligned. Of particular importance is the detection of examples of goal misgeneralization or deceptive alignment, “when the model acts aligned when being evaluated but would pursue different goals during deployment.” Detecting this requires a deep understanding of internal behavior.
In their work, OpenAI used a more powerful model to explain a weaker one – which could lead to problems if it’s not clear whether the assistant itself is trustworthy. “We hope that using smaller trustworthy models for assistance will either scale to a full interpretability audit, or applying them to interpretability will teach us enough about how models work to help us develop more robust auditing methods.”