Skip to content

Sprint 52 T1: wire ATP + Society conformance vectors into Python SDK tests#188

Closed
dp-web4 wants to merge 1 commit into
mainfrom
worker/web4-20260514-180024
Closed

Sprint 52 T1: wire ATP + Society conformance vectors into Python SDK tests#188
dp-web4 wants to merge 1 commit into
mainfrom
worker/web4-20260514-180024

Conversation

@dp-web4
Copy link
Copy Markdown
Owner

@dp-web4 dp-web4 commented May 15, 2026

Summary

Wires the operator-shipped conformance test corpus (web4-standard/testing/conformance/) into the Python SDK pytest suite for the two best-aligned modules. Operator burst-4 (commits a2727b4, 92454d6, 0c39a9b at 12:03–12:04 PDT) shipped 4 JSON suites with the claim "any Web4 implementation MUST produce identical results," but no Python test asserted against them. This PR closes that gap for ATP and Society/Role; tensor and R6/R7 deferred to follow-up sprints.

What lands

  • tests/test_conformance_atp.py (NEW): parametrized runner over atp-operations.json — 11 vectors (5 account, 3 transfer, 3 sliding-scale) + 2 meta checks. 13/13 PASS. Empirically confirms Sprint 49 audit's "ATP is the best-aligned cross-language pair (identical core semantics)" claim.
  • tests/test_conformance_society.py (NEW): per-vector runner over society-roles.json — 9 vectors (2 bootstrap, 4 role, 1 federation, 2 minimum-viable) + 2 meta checks. 8 PASS, 3 strict-xfail with documented divergences.

Strict-xfail findings (per policy review's binding condition)

The policy reviewer required: any failing vector MUST be pytest.mark.xfail(strict=True) with an explicit reason — no silent fixes, no assertion weakening, no vector edits, no SDK behavioral changes. Three divergences documented:

  1. soc-002 (5-state lifecycle): vector unifies society phase + metabolic state into one 5-state enum (genesis/bootstrap/operational/dormant/sunset). Python splits these (SocietyPhase 3-state + separate MetabolicState axis). Cites audit P4 (MetabolicState reconciliation — needs operator decision).
  2. role-004 (assigner-permission table): vector expects a role-based predicate (only sovereign/administrator may assign roles). Python role.py doesn't encode this rule — assignment authority lives in assigned_by data, not in a callable predicate. New surface gap not in Sprint 49 audit.
  3. fed-001 (imperative federation lifecycle): vector expects join_federation/secede actions. Python federation.Society uses constructor-hierarchy pattern (parent=Society, children list). Design-axis divergence, not a defect.

strict=True means: if the SDK ever gains the matching surface, these tests flip from XFAIL to XPASS and the runner errors — preventing silent surface drift.

Out of bounds (deferred)

  • T3/V3 conformance (tensor-operations.json): Sprint 47 documented 8 Rust/Python divergences; needs a dedicated sprint that wires as xfail catalogue with each divergence cited.
  • R6/R7 conformance (r6-r7-actions.json): PR feat(sdk): align Constraint with Rust (threshold+hard) — Sprint 51 T1 audit P6 #187 changed Constraint shape (value: Anythreshold: float + hard: bool) the same day operator shipped vectors. Vectors need a freshness check before wiring.

Cross-reviewer alignment

Addresses Nova GPT's #1 quick-win ("test vectors + conformance") on the Python side. Partially advances Kimi's K2 gap (conformance test suite missing, named rounds 1–4).

Test plan

  • All 13 ATP conformance tests pass against current SDK
  • All 8 Society conformance tests pass; 3 marked strict-xfail with reason strings
  • Full SDK test suite green: 2691 passed + 3 xfailed
  • mypy --strict: clean
  • ruff check + ruff format: clean
  • No modifications to product code in web4/ (verification-only PR)
  • No modifications to conformance vector JSON files (operator-authored, authoritative)

Notes for reviewer

A parallel autonomous session (worker/web4-20260514-180011, launched 13s before this one) proposed the same conformance-wiring scope at broader fidelity (all 4 suites in one file) but stalled at "Step 6 Progress" with no commits and no test files. This PR ships the narrower verified subset. If both branches appear on the queue, this one has the working tested code.

Session log: private-context/autonomous-sessions/legion-web4-20260514-180024-session.md

🤖 Generated with Claude Code

Operator burst-4 (commits a2727b4, 92454d6, 0c39a9b at 12:03–12:04 PDT)
shipped the cross-language conformance corpus to
web4-standard/testing/conformance/ — 4 JSON suites, 35 vectors total,
declared "any Web4 implementation MUST produce identical results." But no
Python SDK test asserted against them.

Sprint 52 T1 wires the two best-aligned suites:

- tests/test_conformance_atp.py: 11 ATP vectors (account, transfer,
  sliding-scale) + 2 meta checks → 13/13 PASS. Empirically confirms
  Sprint 49 audit's "ATP is the best-aligned cross-language pair
  (identical core semantics)" claim — every operator-authored vector
  matches Python SDK output exactly.

- tests/test_conformance_society.py: 9 Society/Role vectors (bootstrap,
  role, federation, minimum-viable) + 2 meta checks → 8 PASS, 3 strict-
  xfail with documented divergences:

    * soc-002 (5-state lifecycle): Python splits combined enum into
      SocietyPhase (3) + MetabolicState (separate axis). Cites audit P4.
    * role-004 (assigner-permission table): SDK role.py does not encode
      role-based permission to assign other roles. New surface gap not
      in Sprint 49 audit.
    * fed-001 (imperative join/secede): SDK federation.Society uses
      constructor-hierarchy pattern, not imperative action methods.

The strict=True xfails convert documentary audit findings into executable
markers: if the SDK ever gains the matching surface, the test flips to
XPASS and must be reviewed — preventing silent surface drift.

Out of bounds: T3/V3 (tensor-operations.json) and R6/R7
(r6-r7-actions.json) conformance NOT wired. Sprint 47 documented 8 T3/V3
divergences (separate sprint needed). R6/R7 vectors need freshness check
post-PR-#187 Constraint shape change.

Addresses Nova GPT review's #1 quick-win (test vectors + conformance)
on the Python side and partially advances Kimi's K2 gap (R6/R7
conformance test suite missing).

2 new test files, 0 product code modifications, 24 new tests (2691 pass +
3 strict-xfail), mypy --strict clean, ruff lint/format clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@dp-web4
Copy link
Copy Markdown
Owner Author

dp-web4 commented May 15, 2026

REJECTED: Superseded by concurrent PR #189 (now merged).

Background: Two autonomous workers raced on the same Sprint 52 T1 task. PR #189 (worker/web4-20260514-180011, launched 13s earlier) shipped all 4 conformance suites (35 vectors) in one file. This PR (worker/web4-20260514-180024) shipped 2 of 4 suites (ATP + Society/Role, 20 vectors) with stricter pytest.mark.xfail(strict=True) semantics.

Why this PR was the closer call than the wording suggests: Your strict-xfail discipline is the better epistemic choice — it makes silent convergence (an xfail that starts passing) loud (XPASS = test failure), which is what R&D needs to identify drift. PR #189 used inline pytest.xfail(...) calls which allow silent convergence. The deferral of T3/V3 and R6/R7 to a separate sprint that can "catalogue divergences" is also methodologically rigorous.

Why I went with #189 instead: Broader coverage NOW (39 tests vs 24) gets more cross-language signal into main today. The xfail-tightening can be applied as a follow-up sprint.

Recommended follow-up sprint scope (Sprint 53 candidate): Convert the inline pytest.xfail(...) calls in #189's test_conformance.py to @pytest.mark.xfail(strict=True) decorators. This is a small, surgical change — keep the divergence reasons, just change the mechanism so silent convergence is detectable. Your two test files in this PR can serve as the reference implementation.

Worker should propose different scope next session — strict-xfail refinement OR a different conformance dimension (e.g., wire the conformance vectors into the Rust SDK now that Python is done, to expose the cross-language gaps the Sprint 47 audit identified).

@dp-web4 dp-web4 closed this May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant