Jacob Dunefsky publications    work    projects   

Publications.

On this page, you can find my links to my most recent publications, along with a brief description of each and their abstracts.

One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs

Dunefsky, J. and Cohan, A. Conference on Language Modeling (COLM), 2025.


TL;DR: We can directly optimize steering vectors on only a single example that are effective in controlling behaviors in LLMs like alignment faking, refusal, emergent misalignment, and truthfulness.

View abstract

Steering vectors (SVs) have emerged as a promising approach for interpreting and controlling LLMs, but current methods typically require large contrastive datasets that are often impractical to construct and may capture spurious correlations. We propose directly optimizing SVs through gradient descent on a single training example, and systematically investigate how these SVs generalize. We consider several SV optimization techniques and find that the resulting SVs effectively mediate safety-relevant behaviors in multiple models. Indeed, in experiments on an alignment-faking model, we are able to optimize one-shot SVs that induce harmful behavior on benign examples and whose negations suppress harmful behavior on malign examples. And in experiments on refusal suppression, we demonstrate that one-shot optimized SVs can transfer across inputs, yielding a Harmbench attack success rate of 96.9%. Furthermore, we extend work on emergent misalignment and show that SVs optimized to induce a model to write vulnerable code cause the model to respond harmfully on unrelated open-ended prompts. Finally, we use one-shot SV optimization to investigate how an instruction-tuned LLM recovers from outputting false information, and find that this ability is independent of the model's explicit verbalization that the information was false. Overall, our findings suggest that optimizing SVs on a single example can mediate a wide array of misaligned behaviors in LLMs.

Transcoders find interpretable LLM feature circuits

Dunefsky, J., Chlenski, P., and Nanda, N. Neural Information Processing Systems (NeurIPS), 2024.


TL;DR: Using transcoders, a sparse and interpretable replacement for an MLP sublayer, we present a method for finding interpretable feature circuits in LLMs. This allows us to interpret how lower-level features affect higher-level ones, both on a single input and across all inputs in general.

View abstract

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize into input-dependent and input-invariant terms. We then successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the "greater-than circuit" in GPT2-small. Our results suggest that transcoders can prove effective in decomposing model computations involving MLPs into interpretable circuits. Code is available at github.com/jacobdunefsky/transcoder_circuits

Observable Propagation: Uncovering Feature Vectors in Transformers

Dunefsky, J. and Cohan, A. International Conference on Machine Learning (ICML), 2024.


TL;DR: We present a method for finding an LLM feature vector corresponding to a given task along a computational path using at most a single input. We then show that these feature vectors can explain LLM behavior in settings like occupational gender bias, political party prediction, and programming language classification.

View abstract

A key goal of current mechanistic interpretability research in NLP is to find linear features (also called "feature vectors") for transformers: directions in activation space corresponding to concepts that are used by a given model in its computation. Present state-of-the-art methods for finding linear features require large amounts of labelled data -- both laborious to acquire and computationally expensive to utilize. In this work, we introduce a novel method, called "observable propagation" (in short: ObsProp), for finding linear features used by transformer language models in computing a given task -- using almost no data. Our paradigm centers on the concept of observables, linear functionals corresponding to given tasks. We then introduce a mathematical theory for the analysis of feature vectors: we provide theoretical motivation for why LayerNorm nonlinearities do not affect the direction of feature vectors; we also introduce a similarity metric between feature vectors called the coupling coefficient which estimates the degree to which one feature's output correlates with another's. We use ObsProp to perform extensive qualitative investigations into several tasks, including gendered occupational bias, political party prediction, and programming language detection. Our results suggest that ObsProp surpasses traditional approaches for finding feature vectors in the low-data regime, and that ObsProp can be used to better understand the mechanisms responsible for bias in large language models. Code for experiments can be found at this http URL.

Older publications

To see some of my older publications, in areas such as network traffic engineering, brain modeling, and tech-assisted psychiatry, click here.