publications | Jacob Dunefsky's personal page

Publications.

On this page, you will find a list of my publications, along with their abstracts.

One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs

Dunefsky, J. and Cohan, A. Conference on Language Modeling (COLM), 2025.

Steering vectors (SVs) have emerged as a promising approach for interpreting and controlling LLMs, but current methods typically require large contrastive datasets that are often impractical to construct and may capture spurious correlations. We propose directly optimizing SVs through gradient descent on a single training example, and systematically investigate how these SVs generalize. We consider several SV optimization techniques and find that the resulting SVs effectively mediate safety-relevant behaviors in multiple models. Indeed, in experiments on an alignment-faking model, we are able to optimize one-shot SVs that induce harmful behavior on benign examples and whose negations suppress harmful behavior on malign examples. And in experiments on refusal suppression, we demonstrate that one-shot optimized SVs can transfer across inputs, yielding a Harmbench attack success rate of 96.9%. Furthermore, we extend work on emergent misalignment and show that SVs optimized to induce a model to write vulnerable code cause the model to respond harmfully on unrelated open-ended prompts. Finally, we use one-shot SV optimization to investigate how an instruction-tuned LLM recovers from outputting false information, and find that this ability is independent of the model's explicit verbalization that the information was false. Overall, our findings suggest that optimizing SVs on a single example can mediate a wide array of misaligned behaviors in LLMs.

Transcoders find interpretable LLM feature circuits

Dunefsky, J., Chlenski, P., and Nanda, N. Neural Information Processing Systems (NeurIPS), 2024.

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize into input-dependent and input-invariant terms. We then successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the "greater-than circuit" in GPT2-small. Our results suggest that transcoders can prove effective in decomposing model computations involving MLPs into interpretable circuits. Code is available at github.com/jacobdunefsky/transcoder_circuits.

Observable Propagation: Uncovering Feature Vectors in Transformers

Dunefsky, J. and Cohan, A. International Conference on Machine Learning (ICML), 2024.

A key goal of current mechanistic interpretability research in NLP is to find linear features (also called "feature vectors") for transformers: directions in activation space corresponding to concepts that are used by a given model in its computation. Present state-of-the-art methods for finding linear features require large amounts of labelled data -- both laborious to acquire and computationally expensive to utilize. In this work, we introduce a novel method, called "observable propagation" (in short: ObsProp), for finding linear features used by transformer language models in computing a given task -- using almost no data. Our paradigm centers on the concept of observables, linear functionals corresponding to given tasks. We then introduce a mathematical theory for the analysis of feature vectors: we provide theoretical motivation for why LayerNorm nonlinearities do not affect the direction of feature vectors; we also introduce a similarity metric between feature vectors called the coupling coefficient which estimates the degree to which one feature's output correlates with another's. We use ObsProp to perform extensive qualitative investigations into several tasks, including gendered occupational bias, political party prediction, and programming language detection. Our results suggest that ObsProp surpasses traditional approaches for finding feature vectors in the low-data regime, and that ObsProp can be used to better understand the mechanisms responsible for bias in large language models. Code for experiments can be found at this http URL.

Transport control networking: optimizing efficiency and control of data transport for data-intensive networks

Dunefsky, J.(co-first-author), Soleimani, M., Yang, R., Ros-Giralt, J., Lassnig, M., Monga, I., Wuerthwein, F.K., Zhang, J., Gao, K. and Yang, Y.R., 2022, August. Transport control networking: optimizing efficiency and control of data transport for data-intensive networks. In Proceedings of the ACM SIGCOMM Workshop on Network-Application Integration (pp. 60-66).

Data-intensive sciences are becoming increasingly important for modern sciences. The transport control plane (TC-Plane) of the networks supporting data-intensive sciences can be important to achieve efficient and controlled transport of data for data-intensive sciences. In this paper, we analyze FTS, which is the de facto TC-Plane of the largest data-intensive network, revealing both efficiency and resource control issues of the current design. We then present the design and initial evaluation of Transport Control Networking (TCN), a design that is based on FTS but introduces (1) network-application co-design/coordination, which uses ALTO to realize network-wide resource control, and (2) a general, efficient, flexible optimization framework for TC-Plane, which allows both zero-order and first-order (e.g., bottleneck structure) gradient-based algorithms. We also discuss future work to engage the broad networking and data-intensive sciences communities.

Analysis of neural clusters due to deep brain stimulation pulses

Kuelbs, D., Dunefsky, J. (co-first-author), Monga, B. and Moehlis, J., 2020. Biological Cybernetics, 114(6), pp.589-607.

Deep brain stimulation (DBS) is an established method for treating pathological conditions such as Parkinson’s disease, dystonia, Tourette syndrome, and essential tremor. While the precise mechanisms which underly the effectiveness of DBS are not fully understood, several theoretical studies of populations of neural oscillators stimulated by periodic pulses have suggested that this may be related to clustering, in which subpopulations of the neurons are synchronized, but the subpopulations are desynchronized with respect to each other. The details of the clustering behavior depend on the frequency and amplitude of the stimulation in a complicated way. In the present study, we investigate how the number of clusters and their stability properties, bifurcations, and basins of attraction can be understood in terms of one-dimensional maps defined on the circle. Moreover, we generalize this analysis to stimuli that consist of pulses with alternating properties, which provide additional degrees of freedom in the design of DBS stimuli. Our results illustrate how the complicated properties of clustering behavior for periodically forced neural oscillator populations can be understood in terms of a much simpler dynamical system.

An Emotion Regulation Tablet App for Middle-Aged and Older Adults at High Suicide Risk: Feasibility, Acceptability, and Two Case Studies

Kiosses, D.N., Monkovic, J., Stern, A., Czaja, S.J., Alexopoulos, G., Arslanoglou, E., Ebo, T., Pantelides, J., Yu, H., Dunefsky, J., Smeragliuolo, A., Putrino, D. 2021. The American Journal of Geriatric Psychiatry.

Objective

The unique features of technological applications may improve the treatment of people at risk of suicide. In this article, we present feasibility and acceptability data as well as two case studies demonstrating the use of WellPATH, a tablet app that aims to help suicidal patients during emotionally-charged situations outside of therapy sessions. The WellPATH app was part of a 12-week psychotherapy intervention (CRISP – Cognitive Reappraisal Intervention for Suicide Prevention) for middle-aged and older adults after their discharge from a suicide-related hospitalization.

Design

The use of WellPATH includes three stages: preparation and practice, incorporation, and actual use.

Measurements

Feasibility was measured by the overall use of WellPATH during 12 weeks, and acceptability was measured with the three items of the Client Satisfaction Questionnaire.

Results

Twelve study participants were administered WellPATH as part of CRISP. The results provide preliminary evidence of feasibility and acceptability of WellPATH. Study participants and therapists reported high satisfaction with WellPATH and provided feedback for future research and development. The patients in the case studies reported a reduction in negative emotions and an increase in emotion regulation (i.e., cognitive reappraisal ability) after using techniques on the WellPATH app.

Conclusion

Our preliminary findings suggest that use of technology applications such as the WellPATH app is feasible and accepted among middle-aged and older adults at high suicide risk. Further research with an adequately powered sample is needed to further evaluate WellPATH's feasibility and accessibility, and test its efficacy with this high-risk population.