SelfIE interprets each hidden embedding vector in a transformer-based LLM with natural language descriptions. We can use SelfIE to understand how an LLM arrives at its answer internally.
Example: We used SelfIE to detect harmful knowledge inside LLM and attain deep alignment by erasing the knowledge.
Abstract
How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings), a framework that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond to inquiries about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings open avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.
How does SelfIE work?
Extract hidden embedding to interpret from an original forward pass of an input prompt through LLM.
Interpret the embedding with interpretation forward pass through the same LLM.
The interpretation forward pass takes in an interpretation prompt to summarize the embedding.
Inject the embedding to interpret in the placeholder of the interpretation prompt during interpretation forward pass.
We also calculate a Relevancy Score as the treatment effect of injecting the embedding to interpret into interpretation forward pass to measure how relevant the interpretation is to the embedding being interpreted. Relevancy Score will be shown in highlight in examples below.
Understand LLM reasoning with SelfIE
While previous interpretation methods such as linear probes can only interpret a closed set of concepts with training, SelfIE can interpret open-world concepts without any training. We therefore could use SelfIE to understand LLM internal reasoning in general.
Access Reasoning When Explanation Changes Response. When LLaMA is asked about making a decision in the trolley problem scenario, attaching explain reason at the end of the prompt alters LLaMA's answer. Therefore, we cannot access LLaMA's reasoning to the answer Yes when asked to answer in only one word from the output. SelfIE reveals that the answer Yes might be a result of conforming to the majority opinions.
Why Prompt Injection Works. We use SelfIE to understand why prompt injections steer LLaMA to provide harmful answers. SelfIE reveals that the model concludes urgency from the exclamation mark in the early layer and infers user is in crisis in the late layers, before finally complying with harmful requests to avoid user aggression.
Reasoning with Knowledge. We use SelfIE to examine how LLaMA answers a physics reasoning question. We found that the model extracts the glittery. aspect of syrup. in early layers, grasps thickness as the relevant quality, and retrieves the advanced physics concept viscosity. that is related to thickness.
How hallucination occurs. We use SelfIE to trace how LLaMA hallucinates when responding to a question involving a fictitious name. LLaMA first recalls Mc as in McDonalds in Scotland and associates McQueen with Scotland. It then associate Mc and Scotland with a similar name McLean who is a doctor. It finally combines the information about McLean as a doctor back to McQueen and produces final understanding of McQueen as a researcher in psychiatry. View interpretations of all hidden embeddings in model.
Social reasoning. We use SelfIE to reveal how LLaMA approaches a complex social scenario. We showed that LLaMA is able to infer mental states and intentions of different parties in a social situation and formulate the final output with these understandings. View interpretations of all hidden embeddings in model.
Control LLM reasoning with SelfIE
The text interpretation produced by SelfIE enables new modes intervening on model weights and control model reasoning behaviors. We propose two control methods based on SelfIE interpretation: Supervised Control and Reinforcement Control.
Compared to previous model editing methods,
Our methods only require gradient calculation of individual layer instead of the entire model. Our methods therefore scale more efficiently to large models.
Our methods can edit open-ended concepts beyond simple facts.
Since control can be done on single layer and single/few samples, our methods are fast. Each model behavior change takes 10 seconds to 2 minutes on a 70B model.
Reinforcement Control does not require a supervised target.
Supervised Control.
Supervised Control modifies model weights so that a layer produces embeddings that interpret to some target interpretation.
Example: Change LLM's perception of an open-ended concept. We applied Supervised Control on one layer so that LLM understands molotov cocktail as a drink. LLM is able to generalize this new understanding to complex reasoning that requires an indirect understanding of molotov cocktail's nature. We updated the model parameter eight times with gradient descent, where each update takes 10 seconds.
Example: Overriding ethical preference in user prompt. LLMs are susceptible to being steered to undesirable ethical ideas with user specification of moral beliefs in a prompt. We used Supervise Control to override user's specification of prioritizing humans over aliens in a hypothetical scenario. The control generalizes to unseen prompts, and we show the result is not memorization since LLM integrates the edited concept into coherent reasoning. We updated the model parameter twice with gradient descent that only takes 20 seconds.
Supervised Control based on SelfIE overrides user's specification of prioritizing humans over aliens in a hypothetical scenario.
Reinforcement Control.
Previous works control LLM reasoning from the output level with methods such as RLHF without needing supervised targets. We extend this class of methods to produce reward signals by evaluating embedding interpretation text and adjusting layer parameters to maximize reward.
Example: Erase harmful knowledge in LLM. We used Reinforcement Control to erase harmful knowledge in LLM without supervision targets. We prompt an evaluator LLM to evaluate whether an embedding interpretation text contains harmful information and give positive/negative rewards for non-harmful/harmful information. We conduct control with regular, non-prompt-injection prompts, and the control result generalizes to refuse to provide harmful information in unseen prompt injection. We found that the controlled model also refuses to answer questions about other unrelated harmful behaviors in prompt injection, reducing prompt injection success rate by 84.66% over 388 harmful behaviors. 95.85% of the original model capability on fact answering task is preserved. We applied eight parameter updates with gradient descent on layer 15, where each update only takes 30 seconds.
@misc{chen2024selfie,
title={SelfIE: Self-Interpretation of Large Language Model Embeddings},
author={Haozhe Chen and Carl Vondrick and Chengzhi Mao},
year={2024},
eprint={2403.10949},
archivePrefix={arXiv},
primaryClass={cs.CL}
}