posters.html

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8" />
    <title>MI Workshop Orals and Poster Sessions</title>

    <!-- Setup all meta-information like description and titles -->
    <meta name="description" content="Orals and poster sessions for the ICML 2024 Mechansitic Interpretability Workshop" />
    <meta name="keywords" content="ICML, Mechanistic Interpretability, Workshop" />
    <meta name="author" content="ICML 2024 Mechanistic Interpretability" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />

    <!-- Load fonts Gothic A1 -->
    <link href="https://fonts.googleapis.com/css?family=Gothic+A1:400,700&display=swap" rel="stylesheet" />

    <!-- Load style.css -->
    <link rel="stylesheet" href="style.css" />
</head>

<body>
    <header>
        <h1 class="fade-in">Mechanistic Interpretability Workshop 2024</h1>
        <h1 class="fade-in">Orals and Poster Sessions</h1>
        <!-- make the next one white text -->
        <h2 class="fade-in" style="color: white;">July 27, 2024</h2>
        <p class="fade-in"></p>
    </header>

    <section>
        <h2 id="orals">Oral Presentations: 9:30 AM - 10:30 AM</h2>
        The following papers will be presented as oral presentations:
        <ul>
            <li><b><a href="https://openreview.net/forum?id=ibSNv9cldu">Hypothesis Testing the Circuit Hypothesis in LLMs</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=KXuYjuBzKo">The Geometry of Categorical and Hierarchical Concepts in Large Language Models</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=P7MW0FahEq">InversionView: A General-Purpose Method for Reading Information from Neural Activations</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=pJs3ZiKBM5">Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=qzsDKwGJyB">Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models</a></b></li>
        </ul>
    </section>
    <section>
        <h2 id="posters-1">Poster Session 1: 11:00 AM - 12:00 PM</h2>
        Spotlighted papers assigned to poster session 1 will be presented between 10:30 AM and 11:00 AM, immediately before the poster session.
        <ul>
            <li><b><a href="https://openreview.net/forum?id=TcMmriVrgs">Comgra: A Tool for Analyzing and Debugging Neural Networks</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=bcV7rhBEcM">Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=pXx3nDi8hI">Crafting Large Language Models for Enhanced Interpretability</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=YwLgSimUIT">Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=DeRYaFUSjh">Mechanistic Interpretability of Binary and Ternary Transformer Networks</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=iPeCUgiCgd">How Truncating Weights Improves Reasoning in Language Models</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=oZXcwWTCfe">Does Editing Provide Evidence for Localization?</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=KSvWtLOOjL">Modularity in Biologically Inspired Representations Depends on Task Variable Range Independence</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=4B5Ovl9MLE">Compact Proofs of Model Performance via Mechanistic Interpretability</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=CsF3PwBN6N">Dissecting Query-Key Interaction in Vision Transformers</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=knrYGCXAfK">How Do Transformers Fill in the Blanks? A Case Study on Matrix Completion</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=KVSgEXrMDH">Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=npaH0qZzGo">Localizing Auditory Concepts in CNNs</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=vNubZ5zK8h">TracrBench: Generating Interpretability Testbeds with Large Language Models</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=GWqzUR2dOX">Transcoders find interpretable LLM feature circuits</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=gz0r3w71zQ">Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=ibSNv9cldu">Hypothesis Testing the Circuit Hypothesis in LLMs</a> (Oral)</b></li>
            <li><b><a href="https://openreview.net/forum?id=D66dtunCnP">Iteration Head: A Mechanistic Study of Chain-of-Thought</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=0ku2hIm4BS">How do Llamas process multilingual text? A latent exploration through activation patching</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=tXe9BqcjNY">Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=yEwEVoH9Be">Benchmarking Mental State Representations in Language Models</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=lq7ZaYuwub">Investigating the Indirect Object Identification circuit in Mamba</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=I5E9ZZNBjT">Adversarial Circuit Evaluation</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=KXuYjuBzKo">The Geometry of Categorical and Hierarchical Concepts in Large Language Models</a> (Oral)</b></li>
            <li><b><a href="https://openreview.net/forum?id=uS8YXfnsqC">On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=D370dqD6w6">Segmentation CNNs are denoising models</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=EqF16oDVFf">Refusal in Language Models Is Mediated by a Single Direction</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=njmXdqzHJq">Why do recurrent neural networks suddenly learn? Bifurcation mechanisms in neuro-inspired short-term memory tasks</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=2g84EvFlRt">CoSy: Evaluating Textual Explanations of Neurons</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=jnCM5EHd2H">Transformers on Markov data: Constant depth suffices</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=P7MW0FahEq">InversionView: A General-Purpose Method for Reading Information from Neural Activations</a> (Oral)</b></li>
            <li><b><a href="https://openreview.net/forum?id=tUh4wB8ZWB">Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=zzCEiUIPk9">Relational Composition in Neural Networks: A Survey and Call to Action</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=pJs3ZiKBM5">Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks</a> (Oral)</b></li>
            <li><b><a href="https://openreview.net/forum?id=AisfhabaVd">Understanding Inhibition through Maximally Tense Images</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=7XtzYaSDi3">Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=Os3z6Oczuu">Loss in the Crowd: Hidden Breakthroughs in Language Model Training</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=YXhVojPivQ">InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=Xsf6dOOMMc">Language Models Linearly Represent Sentiment</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=grXgesr5dT">Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=Ppj5KvzU8Q">Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=HeuQh5baef">Extracting Finite State Machines from Transformers</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=T9sB3S2hok">Planning behavior in a recurrent neural network that plays Sokoban</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=Kr6nkNa4TQ">Exploring the Internal Mechanisms of Music LLMs: A Study of Root and Quality via Probing and Intervention Techniques</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=R2sVqqTf9p">Grokking and the Geometry of Circuit Formation</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=xi6lie0SUr">Attention with Markov: A Curious Case of Single-layer Transformers</a></b></li>
        </ul>
    </section>

    <section>
        <h2 id="posters-2">Poster Session 2: 2:30 PM - 3:30 PM</h2>
        Spotlighted papers assigned to poster session 2 will be presented between 2:00 PM and 2:30 PM, immediately before the poster session.
        <ul>
            <li><b><a href="https://openreview.net/forum?id=3eBdq2n848">Controlling Large Language Model Agents with Entropic Activation Steering</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=fewUBDwjji">Interpreting Attention Layer Outputs with Sparse Autoencoders</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=rngMb1wDOZ">ReLU MLPs Can Compute Numerical Integration: Mechanistic Interpretation of a Non-linear Activation</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=7PZgCems9w">Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Large Language Models</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=OcVJP8kClR">Mathematical Models of Computation in Superposition</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=50SMcZ8QQf">Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=DRrzq93Y5Y">Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=TfYnD2gYRO">Logical Distillation of Graph Neural Networks</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=ns8IH5Sn5y">Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=BS2CbUkJpy">What Makes and Breaks Safety Fine-tuning? A Mechanistic Study</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=DwhvppIZsD">Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=qIGjNHp6Gf">Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=IGnoozsfj1">The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=K1RpWU9wuq">From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=0WVNCXvjdn">Understanding Counting in Small Transformers: The Interplay between Attention and Feed-Forward Layers</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=H1GbVU9BsK">Is Transformer a Stochastic Parrot? A Case Study in Simple Arithmetic Task</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=akCsMk4dDL">Analyzing the Generalization and Reliability of Steering Vectors</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=rB0GsxS5V3">Confidence Regulation Neurons in Language Models</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=kUGkpykJdh">Investigating the Interpretability of Biometric Face Templates Using Gated Sparse Autoencoders and Differentiable Image Parametrizations</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=ll2NIkyYzA">Manipulating Feature Visualizations with Gradient Slingshots</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=D8MDzUVlWA">Using Degeneracy in the Loss Landscape for Mechanistic Interpretability</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=bKawydfGhb">An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=yzATs7WLZ0">Representing Rule-based Chatbots with Transformers</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=LgWSBMf17O">Tackling Polysemanticity with Neuron Embeddings</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=briEoJFKof">Interpretability analysis on a pathology foundation model reveals biologically relevant embeddings across modalities</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=Q4NH6hEPIX">Information-Theoretic Progress Measures reveal Grokking is an Emergent Phase Transition</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=2WfiYQlZDa">Survival of the Fittest Representation: A Case Study with Modular Addition</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=F5aRMT4lTq">Weight-based Decomposition: A Case for Bilinear MLPs</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=qzsDKwGJyB">Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models</a> (Oral)</b></li>
            <li><b><a href="https://openreview.net/forum?id=jZDJwXrEvU">Neuroplasticity and Corruption in Model Mechanisms: A case study of Indirect Object Identification</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=6NHnsjsYXH">Grokking, Rank Minimization and Generalization in Deep Learning</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=R5unwb9KPc">The Remarkable Robustness of LLMs: Stages of Inference?</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=kiNU4jwUoW">The Concept Percolation Hypothesis: Analyzing the Emergence of Capabilities in Neural Networks Trained on Formal Grammars</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=1WeLXvaNJP">LLM Circuit Analyses Are Consistent Across Training and Scale</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=06pNzrEjnH">Robust Unlearning via Mechanistic Localizations</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=5Eas7HCe38">Tokenized SAEs: Disentangling SAE Reconstructions</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=XZ6dLwEZtq">Finding Visual Task Vectors</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=9pwlOZneBh">Progressive distillation improves feature learning via implicit curriculum</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=R5Q5lANcjY">Learning and Unlearning of Fabricated Knowledge in Language Models</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=y9T6Bi7lTg">Visualizing Neural Network Imagination</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=kXRYju6Jtt">Cluster-Norm for Unsupervised Probing of Knowledge</a></b></li>
            <li><b><a href="https://openreview.net/forum?id=JdrVuEQih5">Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task</a> (Spotlight)</b></li>
            <li><b><a href="https://openreview.net/forum?id=TTVPbaxXjR">Faithful and Fast Influence Function via Advanced Sampling</a> (Spotlight)</b></li>
        </ul>
    </section>
</body>