-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathposters.html
143 lines (136 loc) Β· 16.7 KB
/
posters.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>MI Workshop Orals and Poster Sessions</title>
<!-- Setup all meta-information like description and titles -->
<meta name="description" content="Orals and poster sessions for the ICML 2024 Mechansitic Interpretability Workshop" />
<meta name="keywords" content="ICML, Mechanistic Interpretability, Workshop" />
<meta name="author" content="ICML 2024 Mechanistic Interpretability" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<!-- Load fonts Gothic A1 -->
<link href="https://fonts.googleapis.com/css?family=Gothic+A1:400,700&display=swap" rel="stylesheet" />
<!-- Load style.css -->
<link rel="stylesheet" href="style.css" />
</head>
<body>
<header>
<h1 class="fade-in">Mechanistic Interpretability Workshop 2024</h1>
<h1 class="fade-in">Orals and Poster Sessions</h1>
<!-- make the next one white text -->
<h2 class="fade-in" style="color: white;">July 27, 2024</h2>
<p class="fade-in"></p>
</header>
<section>
<h2 id="orals">Oral Presentations: 9:30 AM - 10:30 AM</h2>
The following papers will be presented as oral presentations:
<ul>
<li><b><a href="https://openreview.net/forum?id=ibSNv9cldu">Hypothesis Testing the Circuit Hypothesis in LLMs</a></b></li>
<li><b><a href="https://openreview.net/forum?id=KXuYjuBzKo">The Geometry of Categorical and Hierarchical Concepts in Large Language Models</a></b></li>
<li><b><a href="https://openreview.net/forum?id=P7MW0FahEq">InversionView: A General-Purpose Method for Reading Information from Neural Activations</a></b></li>
<li><b><a href="https://openreview.net/forum?id=pJs3ZiKBM5">Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks</a></b></li>
<li><b><a href="https://openreview.net/forum?id=qzsDKwGJyB">Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models</a></b></li>
</ul>
</section>
<section>
<h2 id="posters-1">Poster Session 1: 11:00 AM - 12:00 PM</h2>
Spotlighted papers assigned to poster session 1 will be presented between 10:30 AM and 11:00 AM, immediately before the poster session.
<ul>
<li><b><a href="https://openreview.net/forum?id=TcMmriVrgs">Comgra: A Tool for Analyzing and Debugging Neural Networks</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=bcV7rhBEcM">Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=pXx3nDi8hI">Crafting Large Language Models for Enhanced Interpretability</a></b></li>
<li><b><a href="https://openreview.net/forum?id=YwLgSimUIT">Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=DeRYaFUSjh">Mechanistic Interpretability of Binary and Ternary Transformer Networks</a></b></li>
<li><b><a href="https://openreview.net/forum?id=iPeCUgiCgd">How Truncating Weights Improves Reasoning in Language Models</a></b></li>
<li><b><a href="https://openreview.net/forum?id=oZXcwWTCfe">Does Editing Provide Evidence for Localization?</a></b></li>
<li><b><a href="https://openreview.net/forum?id=KSvWtLOOjL">Modularity in Biologically Inspired Representations Depends on Task Variable Range Independence</a></b></li>
<li><b><a href="https://openreview.net/forum?id=4B5Ovl9MLE">Compact Proofs of Model Performance via Mechanistic Interpretability</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=CsF3PwBN6N">Dissecting Query-Key Interaction in Vision Transformers</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=knrYGCXAfK">How Do Transformers Fill in the Blanks? A Case Study on Matrix Completion</a></b></li>
<li><b><a href="https://openreview.net/forum?id=KVSgEXrMDH">Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=npaH0qZzGo">Localizing Auditory Concepts in CNNs</a></b></li>
<li><b><a href="https://openreview.net/forum?id=vNubZ5zK8h">TracrBench: Generating Interpretability Testbeds with Large Language Models</a></b></li>
<li><b><a href="https://openreview.net/forum?id=GWqzUR2dOX">Transcoders find interpretable LLM feature circuits</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=gz0r3w71zQ">Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=ibSNv9cldu">Hypothesis Testing the Circuit Hypothesis in LLMs</a> (Oral)</b></li>
<li><b><a href="https://openreview.net/forum?id=D66dtunCnP">Iteration Head: A Mechanistic Study of Chain-of-Thought</a></b></li>
<li><b><a href="https://openreview.net/forum?id=0ku2hIm4BS">How do Llamas process multilingual text? A latent exploration through activation patching</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=tXe9BqcjNY">Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents</a></b></li>
<li><b><a href="https://openreview.net/forum?id=yEwEVoH9Be">Benchmarking Mental State Representations in Language Models</a></b></li>
<li><b><a href="https://openreview.net/forum?id=lq7ZaYuwub">Investigating the Indirect Object Identification circuit in Mamba</a></b></li>
<li><b><a href="https://openreview.net/forum?id=I5E9ZZNBjT">Adversarial Circuit Evaluation</a></b></li>
<li><b><a href="https://openreview.net/forum?id=KXuYjuBzKo">The Geometry of Categorical and Hierarchical Concepts in Large Language Models</a> (Oral)</b></li>
<li><b><a href="https://openreview.net/forum?id=uS8YXfnsqC">On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task</a></b></li>
<li><b><a href="https://openreview.net/forum?id=D370dqD6w6">Segmentation CNNs are denoising models</a></b></li>
<li><b><a href="https://openreview.net/forum?id=EqF16oDVFf">Refusal in Language Models Is Mediated by a Single Direction</a></b></li>
<li><b><a href="https://openreview.net/forum?id=njmXdqzHJq">Why do recurrent neural networks suddenly learn? Bifurcation mechanisms in neuro-inspired short-term memory tasks</a></b></li>
<li><b><a href="https://openreview.net/forum?id=2g84EvFlRt">CoSy: Evaluating Textual Explanations of Neurons</a></b></li>
<li><b><a href="https://openreview.net/forum?id=jnCM5EHd2H">Transformers on Markov data: Constant depth suffices</a></b></li>
<li><b><a href="https://openreview.net/forum?id=P7MW0FahEq">InversionView: A General-Purpose Method for Reading Information from Neural Activations</a> (Oral)</b></li>
<li><b><a href="https://openreview.net/forum?id=tUh4wB8ZWB">Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers</a></b></li>
<li><b><a href="https://openreview.net/forum?id=zzCEiUIPk9">Relational Composition in Neural Networks: A Survey and Call to Action</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=pJs3ZiKBM5">Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks</a> (Oral)</b></li>
<li><b><a href="https://openreview.net/forum?id=AisfhabaVd">Understanding Inhibition through Maximally Tense Images</a></b></li>
<li><b><a href="https://openreview.net/forum?id=7XtzYaSDi3">Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent</a></b></li>
<li><b><a href="https://openreview.net/forum?id=Os3z6Oczuu">Loss in the Crowd: Hidden Breakthroughs in Language Model Training</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=YXhVojPivQ">InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques</a></b></li>
<li><b><a href="https://openreview.net/forum?id=Xsf6dOOMMc">Language Models Linearly Represent Sentiment</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=grXgesr5dT">Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=Ppj5KvzU8Q">Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders</a></b></li>
<li><b><a href="https://openreview.net/forum?id=HeuQh5baef">Extracting Finite State Machines from Transformers</a></b></li>
<li><b><a href="https://openreview.net/forum?id=T9sB3S2hok">Planning behavior in a recurrent neural network that plays Sokoban</a></b></li>
<li><b><a href="https://openreview.net/forum?id=Kr6nkNa4TQ">Exploring the Internal Mechanisms of Music LLMs: A Study of Root and Quality via Probing and Intervention Techniques</a></b></li>
<li><b><a href="https://openreview.net/forum?id=R2sVqqTf9p">Grokking and the Geometry of Circuit Formation</a></b></li>
<li><b><a href="https://openreview.net/forum?id=xi6lie0SUr">Attention with Markov: A Curious Case of Single-layer Transformers</a></b></li>
</ul>
</section>
<section>
<h2 id="posters-2">Poster Session 2: 2:30 PM - 3:30 PM</h2>
Spotlighted papers assigned to poster session 2 will be presented between 2:00 PM and 2:30 PM, immediately before the poster session.
<ul>
<li><b><a href="https://openreview.net/forum?id=3eBdq2n848">Controlling Large Language Model Agents with Entropic Activation Steering</a></b></li>
<li><b><a href="https://openreview.net/forum?id=fewUBDwjji">Interpreting Attention Layer Outputs with Sparse Autoencoders</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=rngMb1wDOZ">ReLU MLPs Can Compute Numerical Integration: Mechanistic Interpretation of a Non-linear Activation</a></b></li>
<li><b><a href="https://openreview.net/forum?id=7PZgCems9w">Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Large Language Models</a></b></li>
<li><b><a href="https://openreview.net/forum?id=OcVJP8kClR">Mathematical Models of Computation in Superposition</a></b></li>
<li><b><a href="https://openreview.net/forum?id=50SMcZ8QQf">Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=DRrzq93Y5Y">Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=TfYnD2gYRO">Logical Distillation of Graph Neural Networks</a></b></li>
<li><b><a href="https://openreview.net/forum?id=ns8IH5Sn5y">Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=BS2CbUkJpy">What Makes and Breaks Safety Fine-tuning? A Mechanistic Study</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=DwhvppIZsD">Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=qIGjNHp6Gf">Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller</a></b></li>
<li><b><a href="https://openreview.net/forum?id=IGnoozsfj1">The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=K1RpWU9wuq">From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport</a></b></li>
<li><b><a href="https://openreview.net/forum?id=0WVNCXvjdn">Understanding Counting in Small Transformers: The Interplay between Attention and Feed-Forward Layers</a></b></li>
<li><b><a href="https://openreview.net/forum?id=H1GbVU9BsK">Is Transformer a Stochastic Parrot? A Case Study in Simple Arithmetic Task</a></b></li>
<li><b><a href="https://openreview.net/forum?id=akCsMk4dDL">Analyzing the Generalization and Reliability of Steering Vectors</a></b></li>
<li><b><a href="https://openreview.net/forum?id=rB0GsxS5V3">Confidence Regulation Neurons in Language Models</a></b></li>
<li><b><a href="https://openreview.net/forum?id=kUGkpykJdh">Investigating the Interpretability of Biometric Face Templates Using Gated Sparse Autoencoders and Differentiable Image Parametrizations</a></b></li>
<li><b><a href="https://openreview.net/forum?id=ll2NIkyYzA">Manipulating Feature Visualizations with Gradient Slingshots</a></b></li>
<li><b><a href="https://openreview.net/forum?id=D8MDzUVlWA">Using Degeneracy in the Loss Landscape for Mechanistic Interpretability</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=bKawydfGhb">An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L</a></b></li>
<li><b><a href="https://openreview.net/forum?id=yzATs7WLZ0">Representing Rule-based Chatbots with Transformers</a></b></li>
<li><b><a href="https://openreview.net/forum?id=LgWSBMf17O">Tackling Polysemanticity with Neuron Embeddings</a></b></li>
<li><b><a href="https://openreview.net/forum?id=briEoJFKof">Interpretability analysis on a pathology foundation model reveals biologically relevant embeddings across modalities</a></b></li>
<li><b><a href="https://openreview.net/forum?id=Q4NH6hEPIX">Information-Theoretic Progress Measures reveal Grokking is an Emergent Phase Transition</a></b></li>
<li><b><a href="https://openreview.net/forum?id=2WfiYQlZDa">Survival of the Fittest Representation: A Case Study with Modular Addition</a></b></li>
<li><b><a href="https://openreview.net/forum?id=F5aRMT4lTq">Weight-based Decomposition: A Case for Bilinear MLPs</a></b></li>
<li><b><a href="https://openreview.net/forum?id=qzsDKwGJyB">Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models</a> (Oral)</b></li>
<li><b><a href="https://openreview.net/forum?id=jZDJwXrEvU">Neuroplasticity and Corruption in Model Mechanisms: A case study of Indirect Object Identification</a></b></li>
<li><b><a href="https://openreview.net/forum?id=6NHnsjsYXH">Grokking, Rank Minimization and Generalization in Deep Learning</a></b></li>
<li><b><a href="https://openreview.net/forum?id=R5unwb9KPc">The Remarkable Robustness of LLMs: Stages of Inference?</a></b></li>
<li><b><a href="https://openreview.net/forum?id=kiNU4jwUoW">The Concept Percolation Hypothesis: Analyzing the Emergence of Capabilities in Neural Networks Trained on Formal Grammars</a></b></li>
<li><b><a href="https://openreview.net/forum?id=1WeLXvaNJP">LLM Circuit Analyses Are Consistent Across Training and Scale</a></b></li>
<li><b><a href="https://openreview.net/forum?id=06pNzrEjnH">Robust Unlearning via Mechanistic Localizations</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=5Eas7HCe38">Tokenized SAEs: Disentangling SAE Reconstructions</a></b></li>
<li><b><a href="https://openreview.net/forum?id=XZ6dLwEZtq">Finding Visual Task Vectors</a></b></li>
<li><b><a href="https://openreview.net/forum?id=9pwlOZneBh">Progressive distillation improves feature learning via implicit curriculum</a></b></li>
<li><b><a href="https://openreview.net/forum?id=R5Q5lANcjY">Learning and Unlearning of Fabricated Knowledge in Language Models</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=y9T6Bi7lTg">Visualizing Neural Network Imagination</a></b></li>
<li><b><a href="https://openreview.net/forum?id=kXRYju6Jtt">Cluster-Norm for Unsupervised Probing of Knowledge</a></b></li>
<li><b><a href="https://openreview.net/forum?id=JdrVuEQih5">Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task</a> (Spotlight)</b></li>
<li><b><a href="https://openreview.net/forum?id=TTVPbaxXjR">Faithful and Fast Influence Function via Advanced Sampling</a> (Spotlight)</b></li>
</ul>
</section>
</body>