Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions arbiter_audit.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{"timestamp": "2026-04-19T15:08:34.257267+00:00", "repo": "arbiter", "score": 89.0, "grade": "CERTIFIED", "findings": 51, "loc": 17379, "dimensions": {"code": 94.7, "governance": 80.5, "dependencies": 100.0, "vitality": 75.0}, "record_hash": "43bce85687ac64eb4c3ff9a4464327896c1770036528920fbb14bad965d382ce", "prev_hash": ""}
{"timestamp": "2026-04-19T15:33:53.324807+00:00", "repo": "agent-governance-demo", "score": 76.5, "grade": "PROVISIONAL", "findings": 5, "loc": 997, "dimensions": {"code": 91.6, "governance": 57.2, "dependencies": 100.0, "vitality": 40.0}, "record_hash": "f7ab4e514711127ed28530f36330d1dc8773414174254c990c1efdb18a238987", "prev_hash": "43bce85687ac64eb4c3ff9a4464327896c1770036528920fbb14bad965d382ce"}
{"timestamp": "2026-04-19T15:34:12.175720+00:00", "repo": "agent-governance-demo", "score": 86.8, "grade": "CERTIFIED", "findings": 1, "loc": 994, "dimensions": {"code": 98.6, "governance": 85.9, "dependencies": 100.0, "vitality": 40.0}, "record_hash": "30f6bf0187a0ebf32cf1f9fda6a1769cb3da7a254f83699902f5021754c22729", "prev_hash": "f7ab4e514711127ed28530f36330d1dc8773414174254c990c1efdb18a238987"}
220 changes: 220 additions & 0 deletions docs/GEMINI_HANDOFF.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
---
packet-version: 1.0
from: claude-code
to: gemini
type: DISPATCH
task-id: governance-beyond-artifacts
priority: HIGH
execution-mode: side_effecting
authorized-by: human (Reuben, 2026-04-19)
---

## Context

Arbiter is a deterministic code quality + governance scoring CLI at `/Users/others/PROJECTS/arbiter/`.
Install: `pip install -e ".[analyzers]"` from repo root.
Run tests: `PYTHONPATH=src python -m pytest tests/ -v`

We ran Arbiter against 201 open-source repos and published the results. An ARCANA peer review
(7 analytical lenses) identified a structural weakness: the governance scorer measures artifact
*presence*, not governance *practice*. This is the Goodhart/Scott problem — file-presence checks
are trivially gameable and miss informal governance that actually works.

This handoff authorizes Gemini to build two new scoring modules that move Arbiter toward
measuring practice, not just artifacts.

## Finding

**Current governance scorer** (`src/arbiter/governance_score.py`): 10 binary file-presence checks.
`(repo_path / "SECURITY.md").exists()` → 15 points. No content analysis. No history analysis.
Same score whether SECURITY.md says "email us" or describes a funded bug bounty with SLA.

**Structural gap 1 — Content quality**: Files exist but quality is unmeasured.
**Structural gap 2 — Temporal/vitality**: Point-in-time snapshot; gameable by adding files today.

**Existing foundation**: `src/arbiter/git_historian.py` already walks git log via subprocess
(stdlib only, no git library). `walk_commits()` returns `CommitInfo` with hash, author, timestamp,
files_changed, loc_added, loc_removed. Gemini builds ON this, not from scratch.

## Recommended Action

### Sprint 1: Governance Quality Scorer (file content analysis)

**Create**: `src/arbiter/governance_quality.py`

Score the *content* of governance files, not just their existence. Pure local filesystem reads,
stdlib only, no network.

Scoring targets (all heuristic/regex, not NLP):

**SECURITY.md quality** (0–15 pts):
- Has a contact method (email, URL, form) → +5
- Mentions a response timeline ("within 48 hours", "5 business days") → +5
- Has a disclosure process (public vs private, CVE process) → +5

**CONTRIBUTING.md quality** (0–15 pts):
- Describes how to run tests → +5
- Describes PR/review process → +5
- Has a code style or linting section → +5

**CI workflow quality** (0–15 pts) — parse `.github/workflows/*.yml`:
- Runs on PR (not just push to main) → +5
- Has more than one job (test + lint, or matrix) → +5
- References a coverage or test command → +5

**README quality** (0–10 pts):
- Length > 500 chars (already partial in governance_score.py) → base
- Has installation instructions (pip install, npm install, cargo add) → +5
- Has usage example or code block → +5

**Output dataclass**:
```python
@dataclass
class GovernanceQualityReport:
security_score: float # 0-15
contributing_score: float # 0-15
ci_quality_score: float # 0-15
readme_score: float # 0-10
total: float # 0-55, normalized to 0-100
details: list[str] # human-readable findings
```

**Integration point**: `governance_score.py` calls `score_governance_quality(repo_path)` and
blends it into the governance dimension. Suggested weighting within governance:
- Artifacts sub-score (current 10 checks): 50%
- Quality sub-score (new): 50%

### Sprint 2: Git Vitality Scorer (history-based governance signals)

**Create**: `src/arbiter/git_vitality.py`

Use the existing `git_historian.walk_commits()` to extract governance-relevant signals from
commit history. Addresses the Goodhart vulnerability: a repo that added all governance files
last week scores differently from one that's had them for 3 years with active contributors.

**Signals to compute**:

**Bus factor** (0–25 pts): count unique committers in last 90 days
- 1 committer → 5 pts (high concentration risk)
- 2–3 committers → 15 pts
- 4+ committers → 25 pts

**Commit recency** (0–25 pts): days since last commit
- 0–30 days → 25 pts
- 31–90 days → 15 pts
- 91–180 days → 8 pts
- 180+ days → 0 pts (effectively unmaintained)

**Release cadence** (0–25 pts): call `git tag --sort=-creatordate` via subprocess
- Has ≥ 1 tag → 10 pts
- Has ≥ 3 tags → 20 pts
- Tags follow SemVer pattern → +5 pts

**Signed commit ratio** (0–25 pts): percentage of commits with "Signed-off-by" in message
- >75% → 25 pts (DCO genuinely enforced)
- 25–75% → 15 pts
- <25% → 5 pts
- 0% → 0 pts (DCO artifact exists but nothing is actually signed)

**Output dataclass**:
```python
@dataclass
class GitVitalityReport:
bus_factor: int # unique committers, 90 days
days_since_commit: int
release_count: int
signed_commit_ratio: float # 0.0–1.0
score: float # 0–100
details: list[str]
```

**Integration point**: Add `git_vitality` as an optional 4th scoring dimension in `scoring.py`.
Weight suggestion when vitality is available: Code (45%) + Governance (25%) + Deps (15%) + Vitality (15%).
When git history unavailable (shallow clone or no commits): fall back to existing 3-dimension weights.

## File Map

```
src/arbiter/
governance_score.py # MODIFY: call quality scorer, blend into governance dim
governance_quality.py # CREATE: Sprint 1
git_vitality.py # CREATE: Sprint 2
scoring.py # MODIFY: add vitality dimension (optional)

tests/
test_governance_quality.py # CREATE: Sprint 1 tests
test_git_vitality.py # CREATE: Sprint 2 tests
```

## Evidence

- ARCANA review findings: Scott lens (metis erasure), Measurement lens (Goodhart HIGH), Foucault lens (artifact-vs-practice)
- `src/arbiter/governance_score.py` lines 62–227: all checks are `Path.exists()` booleans
- `src/arbiter/git_historian.py`: existing walk_commits() foundation for Sprint 2
- `src/arbiter/dep_score.py`: reference pattern for dataclass + scoring function structure

## Tests to Add

Sprint 1 (governance_quality.py):
- `test_security_md_with_contact_scores_higher_than_empty`
- `test_contributing_md_with_test_instructions_gets_full_marks`
- `test_ci_workflow_pr_trigger_detected`
- `test_missing_files_score_zero_not_error`
- `test_quality_score_normalized_to_100`

Sprint 2 (git_vitality.py):
- `test_single_committer_scores_low_bus_factor`
- `test_recent_commit_scores_max_recency`
- `test_semver_tags_detected`
- `test_signed_commit_ratio_computed`
- `test_shallow_clone_degrades_gracefully` (no git history → score=None, not crash)

## Verification Criteria

- All new tests pass: `PYTHONPATH=src python -m pytest tests/test_governance_quality.py tests/test_git_vitality.py -v`
- Full test suite green: `PYTHONPATH=src python -m pytest tests/ -v`
- Self-grade passes: `arbiter score . --fail-under 85`
- No new third-party imports (stdlib + existing arbiter deps only)
- Both new modules have module-level docstrings explaining what they measure vs. what they don't

## Constraints

- **Stdlib only** — no new third-party imports. Regex, pathlib, subprocess, dataclasses only.
- **Branch**: `feat/gemini/governance-beyond-artifacts`
- **Bus identity**: `gemini` (no variants, no parentheticals)
- **Commit format**: Conventional Commits (`feat:`, `test:`, `fix:`)
- **Soft limit**: 500 LOC / 10 files per PR
- **TDD**: write failing tests first, then implement
- **Closeout packet required** — final STATUS must include: artifact paths, test count delta,
self-grade score before/after, open questions deferred, caveats
- **No DESIGN.md unless explicitly requested**
- **No modifications to**: `.github/`, `.claude/`, `docs/blog/`, `docs/CERTIFICATION_REPORT.md`

## Grading Criteria (Claude will audit this PR)

Gemini will be graded on:
1. Correct file paths (all under `src/arbiter/` and `tests/`)
2. TDD discipline (tests written before implementation, or simultaneous)
3. Stdlib-only compliance (no new imports)
4. Graceful degradation (missing files, shallow clones → score=None, not crash)
5. Closeout packet completeness
6. Self-grade score maintained above 85

## Session Start Protocol (mandatory first 5 commands)

```bash
# 1. Confirm correct repo
ls /Users/others/PROJECTS/arbiter/src/arbiter/

# 2. Confirm worktree state
git -C /Users/others/PROJECTS/arbiter status --short

# 3. Create branch
git -C /Users/others/PROJECTS/arbiter checkout -b feat/gemini/governance-beyond-artifacts

# 4. Confirm working directory
pwd

# 5. Run existing tests to establish baseline
cd /Users/others/PROJECTS/arbiter && PYTHONPATH=src python -m pytest tests/ -q --tb=no
```
166 changes: 166 additions & 0 deletions docs/GEMINI_RESUME_2026-04-19.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
---
packet-version: 1.0
from: claude-code
to: gemini
type: FOLLOW-UP
task-id: governance-beyond-artifacts
priority: HIGH
execution-mode: side_effecting
authorized-by: human (Reuben, 2026-04-19)
---

## Context

Sprint 1+2 files already exist on branch `feat/gemini/governance-beyond-artifacts` from a prior
partial session. You are resuming, not starting. All 686 tests pass. Nothing is committed yet.

## Current State (verified 2026-04-19 ~16:00 ET)

**Untracked (Sprint 1+2 deliverables — your work):**
```
src/arbiter/governance_quality.py 183 lines
src/arbiter/git_vitality.py 144 lines
tests/test_governance_quality.py 76 lines
tests/test_git_vitality.py 73 lines
```

**Modified unstaged (integration work — your work):**
```
src/arbiter/__main__.py +12 / -2
src/arbiter/certify.py +23 / -3
src/arbiter/governance_score.py +23 / -4
tests/test_certify.py +8
tests/test_governance_score.py +20 / -7
```

**Test result**: 686 passed, 0 failed, 0 errors (56s).

## Required Actions

### Step 1 — Session start protocol (mandatory)

```bash
# 1. Confirm branch
git -C /Users/others/PROJECTS/arbiter branch --show-current
# Expected: feat/gemini/governance-beyond-artifacts

# 2. Confirm state
git -C /Users/others/PROJECTS/arbiter status --short

# 3. Run baseline tests
cd /Users/others/PROJECTS/arbiter && PYTHONPATH=src python -m pytest tests/ -q --tb=no
# Expected: 686 passed

# 4. Bus post
# gemini → all STATUS "Resuming governance-beyond-artifacts: 686 tests green, starting self-grade + PR"
```

### Step 2 — Self-grade

```bash
cd /Users/others/PROJECTS/arbiter && arbiter score . 2>/dev/null || \
PYTHONPATH=src python -m arbiter score /Users/others/PROJECTS/arbiter
```

Record score before and after your changes. Must be ≥ 85 to pass audit.

### Step 3 — Review integration diffs

Before committing, verify the 5 modified files are correct:

```bash
git -C /Users/others/PROJECTS/arbiter diff src/arbiter/governance_score.py
git -C /Users/others/PROJECTS/arbiter diff src/arbiter/__main__.py
git -C /Users/others/PROJECTS/arbiter diff src/arbiter/certify.py
```

Verify:
- `governance_score.py` calls `score_governance_quality()` and blends result (50/50 artifacts/quality)
- `__main__.py` exposes the new dimensions in CLI output
- `certify.py` includes vitality dimension when git history is available
- No new third-party imports in any file (`import re`, `import pathlib`, `import subprocess`, `import dataclasses` are all fine)

### Step 4 — Commit (two-commit strategy)

```bash
cd /Users/others/PROJECTS/arbiter

# Commit 1: Sprint 1 — governance quality scorer
git add src/arbiter/governance_quality.py tests/test_governance_quality.py \
src/arbiter/governance_score.py tests/test_governance_score.py
git commit -m "feat(governance): add content-quality scoring to governance dimension

Adds governance_quality.py that scores SECURITY.md, CONTRIBUTING.md,
CI workflows, and README content via regex (not just existence checks).
Addresses Goodhart/Scott finding from ARCANA peer review: file presence
was trivially gameable; content heuristics are not.

Blends quality sub-score (50%) with artifact sub-score (50%) in governance_score.py."

# Commit 2: Sprint 2 — git vitality scorer
git add src/arbiter/git_vitality.py tests/test_git_vitality.py \
src/arbiter/certify.py src/arbiter/__main__.py tests/test_certify.py
git commit -m "feat(vitality): add git history vitality dimension to scoring

Adds git_vitality.py that scores bus factor, commit recency, release
cadence, and signed-commit ratio from git log. Addresses temporal
Goodhart vulnerability: repos cannot game history by adding files today.

Vitality is an optional 4th dimension (15% weight) when git history
is available. Degrades gracefully on shallow clones."
```

### Step 5 — Push and open PR

```bash
git -C /Users/others/PROJECTS/arbiter push origin feat/gemini/governance-beyond-artifacts
gh pr create \
--repo $(git -C /Users/others/PROJECTS/arbiter remote get-url origin | sed 's/.*github.com\///' | sed 's/\.git$//') \
--title "feat(arbiter): governance quality + git vitality scoring (Sprint 1+2)" \
--body "..."
```

PR body must include:
- What changed and why (ARCANA peer review finding)
- Test count before/after
- Self-grade before/after
- Closeout packet (see Step 6)

### Step 6 — Closeout packet (mandatory, post to bus)

Bus post format:
```
gemini → all STATUS "governance-beyond-artifacts COMPLETE.
ARTIFACTS: governance_quality.py (183L), git_vitality.py (144L), 4 test files.
TESTS: 686 total (10 new), 0 failures.
SELF-GRADE: before=<N>, after=<N>.
PR: <url>.
OPEN: <any deferred questions>.
CAVEATS: <any uncertainty>.
SOURCES: ARCANA peer review (Scott/Measurement/Foucault lenses), git_historian.py foundation."
```

## Verification Criteria (Claude audit checklist)

- [ ] `PYTHONPATH=src python -m pytest tests/ -v` → all 686+ pass
- [ ] `git diff --name-only HEAD` shows only expected files
- [ ] `governance_quality.py` has no `import requests` / `import numpy` / any non-stdlib
- [ ] `git_vitality.py` uses only `subprocess`, `re`, `dataclasses`, `datetime`
- [ ] `governance_score.py` blends quality 50/50 (not replaces)
- [ ] Self-grade ≥ 85
- [ ] Closeout packet posted to bus
- [ ] PR title follows Conventional Commits
- [ ] No modifications to `.github/`, `.claude/`, `docs/blog/`

## Constraints

- **Stdlib only** — no new third-party imports
- **Branch**: `feat/gemini/governance-beyond-artifacts` (already checked out)
- **Bus identity**: `gemini` (no parentheticals)
- **GEMINI_SESSION=true** — scope gate enforces this
- Do NOT modify `docs/CERTIFICATION_REPORT.md`, `docs/blog/`, `.github/`

---

*This packet supersedes the dispatch spec at `docs/GEMINI_HANDOFF.md` for current session state.
GEMINI_HANDOFF.md remains authoritative for Sprint 1+2 requirements.*
Loading
Loading