Sprint 52 T1: wire ATP + Society conformance vectors into Python SDK tests#188
Sprint 52 T1: wire ATP + Society conformance vectors into Python SDK tests#188dp-web4 wants to merge 1 commit into
Conversation
Operator burst-4 (commits a2727b4, 92454d6, 0c39a9b at 12:03–12:04 PDT) shipped the cross-language conformance corpus to web4-standard/testing/conformance/ — 4 JSON suites, 35 vectors total, declared "any Web4 implementation MUST produce identical results." But no Python SDK test asserted against them. Sprint 52 T1 wires the two best-aligned suites: - tests/test_conformance_atp.py: 11 ATP vectors (account, transfer, sliding-scale) + 2 meta checks → 13/13 PASS. Empirically confirms Sprint 49 audit's "ATP is the best-aligned cross-language pair (identical core semantics)" claim — every operator-authored vector matches Python SDK output exactly. - tests/test_conformance_society.py: 9 Society/Role vectors (bootstrap, role, federation, minimum-viable) + 2 meta checks → 8 PASS, 3 strict- xfail with documented divergences: * soc-002 (5-state lifecycle): Python splits combined enum into SocietyPhase (3) + MetabolicState (separate axis). Cites audit P4. * role-004 (assigner-permission table): SDK role.py does not encode role-based permission to assign other roles. New surface gap not in Sprint 49 audit. * fed-001 (imperative join/secede): SDK federation.Society uses constructor-hierarchy pattern, not imperative action methods. The strict=True xfails convert documentary audit findings into executable markers: if the SDK ever gains the matching surface, the test flips to XPASS and must be reviewed — preventing silent surface drift. Out of bounds: T3/V3 (tensor-operations.json) and R6/R7 (r6-r7-actions.json) conformance NOT wired. Sprint 47 documented 8 T3/V3 divergences (separate sprint needed). R6/R7 vectors need freshness check post-PR-#187 Constraint shape change. Addresses Nova GPT review's #1 quick-win (test vectors + conformance) on the Python side and partially advances Kimi's K2 gap (R6/R7 conformance test suite missing). 2 new test files, 0 product code modifications, 24 new tests (2691 pass + 3 strict-xfail), mypy --strict clean, ruff lint/format clean. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
REJECTED: Superseded by concurrent PR #189 (now merged). Background: Two autonomous workers raced on the same Sprint 52 T1 task. PR #189 ( Why this PR was the closer call than the wording suggests: Your strict-xfail discipline is the better epistemic choice — it makes silent convergence (an xfail that starts passing) loud (XPASS = test failure), which is what R&D needs to identify drift. PR #189 used inline Why I went with #189 instead: Broader coverage NOW (39 tests vs 24) gets more cross-language signal into main today. The xfail-tightening can be applied as a follow-up sprint. Recommended follow-up sprint scope (Sprint 53 candidate): Convert the inline Worker should propose different scope next session — strict-xfail refinement OR a different conformance dimension (e.g., wire the conformance vectors into the Rust SDK now that Python is done, to expose the cross-language gaps the Sprint 47 audit identified). |
Summary
Wires the operator-shipped conformance test corpus (
web4-standard/testing/conformance/) into the Python SDK pytest suite for the two best-aligned modules. Operator burst-4 (commits a2727b4, 92454d6, 0c39a9b at 12:03–12:04 PDT) shipped 4 JSON suites with the claim "any Web4 implementation MUST produce identical results," but no Python test asserted against them. This PR closes that gap for ATP and Society/Role; tensor and R6/R7 deferred to follow-up sprints.What lands
tests/test_conformance_atp.py(NEW): parametrized runner overatp-operations.json— 11 vectors (5 account, 3 transfer, 3 sliding-scale) + 2 meta checks. 13/13 PASS. Empirically confirms Sprint 49 audit's "ATP is the best-aligned cross-language pair (identical core semantics)" claim.tests/test_conformance_society.py(NEW): per-vector runner oversociety-roles.json— 9 vectors (2 bootstrap, 4 role, 1 federation, 2 minimum-viable) + 2 meta checks. 8 PASS, 3 strict-xfail with documented divergences.Strict-xfail findings (per policy review's binding condition)
The policy reviewer required: any failing vector MUST be
pytest.mark.xfail(strict=True)with an explicit reason — no silent fixes, no assertion weakening, no vector edits, no SDK behavioral changes. Three divergences documented:role.pydoesn't encode this rule — assignment authority lives inassigned_bydata, not in a callable predicate. New surface gap not in Sprint 49 audit.join_federation/secedeactions. Pythonfederation.Societyuses constructor-hierarchy pattern (parent=Society,childrenlist). Design-axis divergence, not a defect.strict=Truemeans: if the SDK ever gains the matching surface, these tests flip from XFAIL to XPASS and the runner errors — preventing silent surface drift.Out of bounds (deferred)
tensor-operations.json): Sprint 47 documented 8 Rust/Python divergences; needs a dedicated sprint that wires as xfail catalogue with each divergence cited.r6-r7-actions.json): PR feat(sdk): align Constraint with Rust (threshold+hard) — Sprint 51 T1 audit P6 #187 changed Constraint shape (value: Any→threshold: float+hard: bool) the same day operator shipped vectors. Vectors need a freshness check before wiring.Cross-reviewer alignment
Addresses Nova GPT's #1 quick-win ("test vectors + conformance") on the Python side. Partially advances Kimi's K2 gap (conformance test suite missing, named rounds 1–4).
Test plan
web4/(verification-only PR)Notes for reviewer
A parallel autonomous session (
worker/web4-20260514-180011, launched 13s before this one) proposed the same conformance-wiring scope at broader fidelity (all 4 suites in one file) but stalled at "Step 6 Progress" with no commits and no test files. This PR ships the narrower verified subset. If both branches appear on the queue, this one has the working tested code.Session log:
private-context/autonomous-sessions/legion-web4-20260514-180024-session.md🤖 Generated with Claude Code