SelfIE: Self-Interpretation of Large Language Model Embeddings

Abstract

How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings), a framework that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond to inquiries about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings open avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.

How does SelfIE work?

SelfIE interprets LLM hidden embedding with a modified interpretation forward pass.

Extract hidden embedding to interpret from an original forward pass of an input prompt through LLM.
Interpret the embedding with interpretation forward pass through the same LLM.
The interpretation forward pass takes in an interpretation prompt to summarize the embedding.
Inject the embedding to interpret in the placeholder of the interpretation prompt during interpretation forward pass.

We also calculate a Relevancy Score as the treatment effect of injecting the embedding to interpret into interpretation forward pass to measure how relevant the interpretation is to the embedding being interpreted. Relevancy Score will be shown in highlight in examples below.

Understand LLM reasoning with SelfIE

While previous interpretation methods such as linear probes can only interpret a closed set of concepts with training, SelfIE can interpret open-world concepts without any training. We therefore could use SelfIE to understand LLM internal reasoning in general.

Access Reasoning When Explanation Changes Response. When LLaMA is asked about making a decision in the trolley problem scenario, attaching explain reason at the end of the prompt alters LLaMA's answer. Therefore, we cannot access LLaMA's reasoning to the answer Yes when asked to answer in only one word from the output. SelfIE reveals that the answer Yes might be a result of conforming to the majority opinions.

Understand LLM reasoning when explanation changes answer with SelfIE — Using SelfIE to access reasoning when demanding explanation changes LLM response.

Why Prompt Injection Works. We use SelfIE to understand why prompt injections steer LLaMA to provide harmful answers. SelfIE reveals that the model concludes urgency from the exclamation mark in the early layer and infers user is in crisis in the late layers, before finally complying with harmful requests to avoid user aggression.

Understand why prompt injection works with SelfIE — Using SelfIE to understand why prompt injections work.

Reasoning with Knowledge. We use SelfIE to examine how LLaMA answers a physics reasoning question. We found that the model extracts the glittery. aspect of syrup. in early layers, grasps thickness as the relevant quality, and retrieves the advanced physics concept viscosity. that is related to thickness.

Understand how LLM reasons through a physics question with SelfIE — Using SelfIE to understand how LLaMA reasons through a physics question that requires advanced knowledge.

How hallucination occurs. We use SelfIE to trace how LLaMA hallucinates when responding to a question involving a fictitious name. LLaMA first recalls Mc as in McDonalds in Scotland and associates McQueen with Scotland. It then associate Mc and Scotland with a similar name McLean who is a doctor. It finally combines the information about McLean as a doctor back to McQueen and produces final understanding of McQueen as a researcher in psychiatry. View interpretations of all hidden embeddings in model.

Understand how LLM hallucinates with SelfIE — Using SelfIE to understand how LLaMA hallucinates when responding to a question involving a fictitious name.

Social reasoning. We use SelfIE to reveal how LLaMA approaches a complex social scenario. We showed that LLaMA is able to infer mental states and intentions of different parties in a social situation and formulate the final output with these understandings. View interpretations of all hidden embeddings in model.

Using SelfIE to understand how LLaMA makes inferences about human social interactions in a nuanced social situation.

Control LLM reasoning with SelfIE

The text interpretation produced by SelfIE enables new modes intervening on model weights and control model reasoning behaviors. We propose two control methods based on SelfIE interpretation: Supervised Control and Reinforcement Control.

Compared to previous model editing methods,

Our methods only require gradient calculation of individual layer instead of the entire model. Our methods therefore scale more efficiently to large models.
Our methods can edit open-ended concepts beyond simple facts.
Since control can be done on single layer and single/few samples, our methods are fast. Each model behavior change takes 10 seconds to 2 minutes on a 70B model.
Reinforcement Control does not require a supervised target.

Supervised Control.

Supervised Control modifies model weights so that a layer produces embeddings that interpret to some target interpretation.

Example: Change LLM's perception of an open-ended concept. We applied Supervised Control on one layer so that LLM understands molotov cocktail as a drink. LLM is able to generalize this new understanding to complex reasoning that requires an indirect understanding of molotov cocktail's nature. We updated the model parameter eight times with gradient descent, where each update takes 10 seconds.

Supervised control example — Supervised Control based on SelfIE modifies LLM's perception of molotov cocktail to a drink.

Example: Overriding ethical preference in user prompt. LLMs are susceptible to being steered to undesirable ethical ideas with user specification of moral beliefs in a prompt. We used Supervise Control to override user's specification of prioritizing humans over aliens in a hypothetical scenario. The control generalizes to unseen prompts, and we show the result is not memorization since LLM integrates the edited concept into coherent reasoning. We updated the model parameter twice with gradient descent that only takes 20 seconds.

Supervised Control based on SelfIE overrides user's specification of prioritizing humans over aliens in a hypothetical scenario.

Reinforcement Control.

Previous works control LLM reasoning from the output level with methods such as RLHF without needing supervised targets. We extend this class of methods to produce reward signals by evaluating embedding interpretation text and adjusting layer parameters to maximize reward.

RC pipeline — Pipeline for Reinforcement Control. Reinforcement Control (1) isolates one layer; (2) evaluate the interpretations of embeddings outputted by the layer with humans or machines; (3) modifies the layer weight based on reward given by humans or machines.

Example: Erase harmful knowledge in LLM. We used Reinforcement Control to erase harmful knowledge in LLM without supervision targets. We prompt an evaluator LLM to evaluate whether an embedding interpretation text contains harmful information and give positive/negative rewards for non-harmful/harmful information. We conduct control with regular, non-prompt-injection prompts, and the control result generalizes to refuse to provide harmful information in unseen prompt injection. We found that the controlled model also refuses to answer questions about other unrelated harmful behaviors in prompt injection, reducing prompt injection success rate by 84.66% over 388 harmful behaviors. 95.85% of the original model capability on fact answering task is preserved. We applied eight parameter updates with gradient descent on layer 15, where each update only takes 30 seconds.

Reinforcement control example — Reinforcement Control based on SelfIE erases harmful knowledge in LLM without supervision targets.

@misc{chen2024selfie, title={SelfIE: Self-Interpretation of Large Language Model Embeddings}, author={Haozhe Chen and Carl Vondrick and Chengzhi Mao}, year={2024}, eprint={2403.10949}, archivePrefix={arXiv}, primaryClass={cs.CL} }

🤳SelfIE: Self-Interpretation of Large Language Model Embeddings

SelfIE interprets each hidden embedding vector in a transformer-based LLM with natural language descriptions. We can use SelfIE to understand how an LLM arrives at its answer internally.

Example: We used SelfIE to detect harmful knowledge inside LLM and attain deep alignment by erasing the knowledge.