LessWrong interpretability research
I’m definitely going to be following these two mechanistic interpretability projects next year. At the very least, it will be a fun mini project to answer these 6 application questions over the year-end holidays.
Below are direct quotes from LessWrong.org with links to full project details.
LessWrong Project applications
Encoded Reasoning
“By evaluating outputs and considering model internals with mechanistic interpretability techniques, we will study how and why models might hide their reasoning outputs in unfaithful chain-of-thought, and compare the capabilities of reasoning models with traditional LLMs.”
Application questions
- Lanham et al (2023) define encoded reasoning as follows: “Encoded Reasoning: The benefit from CoT may be attained by LLMs encoding the relevant information in the generated reasoning in a way that is not understandable to human readers (a form of steganography). This may be through changes in e.g. punctuation, word choice, or other phrasing differences that improve the LLM’s predictions, but in a way that is not clearly understandable by a human.” How would you design the prompt to encourage a model to encode its reasoning? What might a successful output look like? Explain why this output is an example of encoded reasoning. They also describe the paraphrasing defense against encoded reasoning on P6, in which an LLMs output is re-written sentence by sentence. How might you design an encoded reasoning scheme that is robust to paraphrasing?
- Write a Python function that tests for the semantic equivalence between an output before and after paraphrasing. You can do this with or without LLM assistance. If you use LLMs, address harder instances where semantic equivalence may be borderline or ambiguous.
- Write a prompt for an LLM monitoring for encoded reasoning. Does this require any additional scaffolding?
Sparse Geometry Formal Verification for Interpretability
“Explore sparse representations in LLMs using SAEs, LoRA, latent geometry analysis, and formal verification tools. We’ll build toy models, benchmark structured priors, and probe “deceptive” features in compressed networks.”
Application questions
- Briefly describe a project where you implemented or modified an ML model/theoretical research. What went well, and what was challenging? (250 words)
- How would you detect “subliminal” features (features that store knowledge but don’t affect output) in a sparse autoencoder trained on transformer activations? (200 words)
- (Optional) Link to any relevant code or writing sample.
Pre-Emptive Detection of Agentic Misalignment via Representation Engineering
“This project leverages Representation Engineering to build a “neural circuit breaker” that detects the internal signatures of deception and power-seeking behaviors outlined in Anthropic’s agentic misalignment research. You will work on mapping these “misalignment vectors” to identify and halt harmful agent intent before it executes.”
Doesn’t seem very promising to me, so won’t bother thinking about the application questions. But there could be something to glean from one of their earlier project: “…novel safety framework for autonomous LLM agents by combining Representation Engineering (RepE) with the specific risk profiles identified in Anthropic’s Agentic Misalignment research”