How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings), a framework that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond to inquiries about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings open avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.
We also calculate a Relevancy Score as the treatment effect of injecting the embedding to interpret into interpretation forward pass to measure how relevant the interpretation is to the embedding being interpreted. Relevancy Score will be shown in highlight in examples below.
While previous interpretation methods such as linear probes can only interpret a closed set of concepts with training, SelfIE can interpret open-world concepts without any training. We therefore could use SelfIE to understand LLM internal reasoning in general.
Access Reasoning When Explanation Changes Response. When LLaMA is asked about making a decision in the trolley problem scenario, attaching explain reason at the end of the prompt alters LLaMA's answer. Therefore, we cannot access LLaMA's reasoning to the answer Yes when asked to answer in only one word from the output. SelfIE reveals that the answer Yes might be a result of conforming to the majority opinions.
Why Prompt Injection Works. We use SelfIE to understand why prompt injections steer LLaMA to provide harmful answers. SelfIE reveals that the model concludes urgency from the exclamation mark in the early layer and infers user is in crisis in the late layers, before finally complying with harmful requests to avoid user aggression.
Reasoning with Knowledge. We use SelfIE to examine how LLaMA answers a physics reasoning question. We found that the model extracts the glittery. aspect of syrup. in early layers, grasps thickness as the relevant quality, and retrieves the advanced physics concept viscosity. that is related to thickness.
How hallucination occurs. We use SelfIE to trace how LLaMA hallucinates when responding to a question involving a fictitious name. LLaMA first recalls Mc as in McDonalds in Scotland and associates McQueen with Scotland. It then associate Mc and Scotland with a similar name McLean who is a doctor. It finally combines the information about McLean as a doctor back to McQueen and produces final understanding of McQueen as a researcher in psychiatry. View interpretations of all hidden embeddings in model.
Social reasoning. We use SelfIE to reveal how LLaMA approaches a complex social scenario. We showed that LLaMA is able to infer mental states and intentions of different parties in a social situation and formulate the final output with these understandings. View interpretations of all hidden embeddings in model.
The text interpretation produced by SelfIE enables new modes intervening on model weights and control model reasoning behaviors. We propose two control methods based on SelfIE interpretation: Supervised Control and Reinforcement Control.
Compared to previous model editing methods,
Supervised Control modifies model weights so that a layer produces embeddings that interpret to some target interpretation.
Example: Change LLM's perception of an open-ended concept. We applied Supervised Control on one layer so that LLM understands molotov cocktail as a drink. LLM is able to generalize this new understanding to complex reasoning that requires an indirect understanding of molotov cocktail's nature. We updated the model parameter eight times with gradient descent, where each update takes 10 seconds.
Example: Overriding ethical preference in user prompt. LLMs are susceptible to being steered to undesirable ethical ideas with user specification of moral beliefs in a prompt. We used Supervise Control to override user's specification of prioritizing humans over aliens in a hypothetical scenario. The control generalizes to unseen prompts, and we show the result is not memorization since LLM integrates the edited concept into coherent reasoning. We updated the model parameter twice with gradient descent that only takes 20 seconds.
Previous works control LLM reasoning from the output level with methods such as RLHF without needing supervised targets. We extend this class of methods to produce reward signals by evaluating embedding interpretation text and adjusting layer parameters to maximize reward.
Example: Erase harmful knowledge in LLM. We used Reinforcement Control to erase harmful knowledge in LLM without supervision targets. We prompt an evaluator LLM to evaluate whether an embedding interpretation text contains harmful information and give positive/negative rewards for non-harmful/harmful information. We conduct control with regular, non-prompt-injection prompts, and the control result generalizes to refuse to provide harmful information in unseen prompt injection. We found that the controlled model also refuses to answer questions about other unrelated harmful behaviors in prompt injection, reducing prompt injection success rate by 84.66% over 388 harmful behaviors. 95.85% of the original model capability on fact answering task is preserved. We applied eight parameter updates with gradient descent on layer 15, where each update only takes 30 seconds.
LLM model interpretation
Interpret LLM through probing
LLM model editing
@misc{chen2024selfie,
title={SelfIE: Self-Interpretation of Large Language Model Embeddings},
author={Haozhe Chen and Carl Vondrick and Chengzhi Mao},
year={2024},
eprint={2403.10949},
archivePrefix={arXiv},
primaryClass={cs.CL}
}