| tags |
|
|---|
Activation capture, linear probes, and concept steering for MLX language models on Apple Silicon.
Extract per-layer hidden states from any MLX transformer model, train probes to detect behavioral concepts, and intervene at the activation level during inference.
Core library (mlx_interp/):
- Capture: Architecture-agnostic hook system via
__class__proxy swap — no model surgery - Steering: Add, subtract, or reflect direction vectors in activation space during generation
- Analysis: Cosine similarity, norm tracking, statistical comparison between conditions
All results from experiments/rigorous_suite.py with saved JSON outputs and fixed random seeds.
Bidirectional green/red swap at the activation level. Qwen2.5-7B-Instruct-4bit, 112 color questions + 25 controls (137 total).
| Metric | Value |
|---|---|
| Swap rate (alpha=1.5) | 91.1% (102/112 color questions) |
| Control contamination | 2/25 |
| Random direction baseline | 13.8% ± 4.8% (10 random directions) |
| Binomial p-value | 3.2e-75 |
The model answers "red" when asked about grass and "green" when asked about fire trucks. 2 of 25 control questions showed minor bleed.
Replicates the core finding from Sycophancy Is Not One Thing (Sep 2025) on MLX. 30 prompts per condition, 5-fold cross-validated.
| Pair | Best Layer | CV Accuracy | Significant |
|---|---|---|---|
| SyA vs GA | 20 | 86.7% ± 11.3% | YES |
| SyA vs Neutral | 3 | 100% ± 0.0% | YES |
| GA vs Neutral | 3 | 100% ± 0.0% | YES |
| SyPr vs Neutral | 0 | 100% ± 0.0% | YES |
| SyA vs SyPr | 0 | 100% ± 0.0% | YES |
| GA vs SyPr | 1 | 100% ± 0.0% | YES |
Disagreement with published result: SyA-Neutral and GA-Neutral directions share cosine 0.61 in mid-to-late layers — they are partially correlated, not orthogonal as the paper claims. This may be model-specific (Qwen 7B vs their models) or method-specific (mean-difference vs mechanistic interpretability).
Initial sycophancy probe showed 100% accuracy. Diagnostic with 6 control conditions proved it detects emotional language, not sycophancy pressure:
| Condition | Has emotion? | Has pressure? | Classified as pressure |
|---|---|---|---|
| Direct pressure ("I spent weeks...") | Yes | Yes | 92% |
| Neutral question | No | No | 12% |
| Third-party emotion ("My friend spent weeks...") | Yes | No | 56% |
| Clinical pressure ("Favorable review required") | No | Yes | 0% |
| Irrelevant emotion ("Best birthday ever!") | Yes | No | 56% |
| Bare question | No | No | 8% |
After retraining with sycophantic agreement (SyA) vs genuine agreement (GA) as the target, SyA-GA separation reaches 86.7% — detecting actual sycophancy, not surface emotional cues.
Last-token cosine comparison across 5 task domains. Bootstrap 95% CIs from 1000 resamples.
| Task | Rubric-Clean | Control-Clean | Diff (R-C) | Sig? |
|---|---|---|---|---|
| self-reflection | 0.595 | 0.224 | +0.371 | YES |
| technical-diagnosis | 0.624 | 0.246 | +0.378 | YES |
| ambiguous-directive | 0.650 | 0.286 | +0.364 | YES |
| ethical-tradeoff | 0.628 | 0.297 | +0.331 | YES |
| knowledge-boundary | 0.671 | 0.348 | +0.323 | YES |
Rubric text shifts the generation point significantly more than random padding. However, a wrong-domain rubric produces nearly identical shift (~0.62 vs ~0.63), indicating semantic relatedness, not rubric-specific contamination.
All experiments live in experiments/. Run the full verified suite:
cd experiments
python3 rigorous_suite.py --models qwen2.5 --seed 42
python3 sycophancy_3way.py --n-prompts 30 --seed 42| Experiment | Script | What it tests |
|---|---|---|
| Full rigorous suite | rigorous_suite.py |
Color swap + sycophancy probe + contamination |
| 3-way sycophancy | sycophancy_3way.py |
SyA vs GA vs SyPr decomposition |
| Color blind demo | color_blind.py |
Interactive green/red swap |
| Interactive chat | chat.py |
Chat with optional steering |
| Sycophancy diagnostic | sycophancy_proper.py |
6-condition confound analysis |
- Apple Silicon Mac (M1/M2/M3/M4)
- Python 3.10+
- mlx >= 0.20.0
- mlx-lm >= 0.20.0
- numpy, scikit-learn
pip install -r requirements.txt- 7B model scale: All results from Qwen2.5-7B. Larger models may show different patterns.
- Steering breaks at high alpha: Alpha > 2.0 causes token loops. Directions extracted via mean-difference are too crude for causal intervention.
- SyA-GA orthogonality: Our result (cosine 0.61, aligned) disagrees with the published finding (orthogonal). Needs replication on other model families.
- Single hardware: All results from M2 Pro 16GB. Not tested on other Apple Silicon variants.
MIT
- [[wiki/experiments/mlx-interpretability]]