Add tokenstr serialization option to AlignmentGroup asdict methods by sboisen · Pull Request #2 · Clear-Bible/biblealignlib

sboisen · 2026-03-24T22:20:30Z

Summary

Add optional source_tokens and target_tokens parameters to AlignmentRecord.asdict(), AlignmentGroup.asdict(), and TopLevelGroups.asdict()
When provided, these dicts (mapping bare token IDs → token objects) replace plain ID selectors with tokenstr representations ("{id}|{text}")
The ID portion is bare by default; uses the macula-prefixed form when withmaculaprefix=True
Also imports macula_unprefixer into AlignmentGroup.py to support prefix stripping during lookup

Test plan

test_asdict_with_source_tokens — source selectors replaced with bare tokenstr
test_asdict_with_source_tokens_withmaculaprefix — source selectors replaced with prefixed tokenstr
test_asdict_with_target_tokens — target selectors replaced with tokenstr
test_asdict_with_both_token_dicts — both source and target replaced
test_asdict_missing_token_leaves_selector — selectors not in dict fall back to plain ID
test_asdict_with_token_dicts (AlignmentGroup) — token dicts propagated through to records
All 29 existing tests continue to pass

https://claude.ai/code/session_019ZWN6Aj7hG62s81j1wmgQ7

AlignmentRecord.asdict(), AlignmentGroup.asdict(), and TopLevelGroups.asdict() now accept optional source_tokens and target_tokens dicts (mapping bare token IDs to token objects). When provided, selectors are replaced with tokenstr representations ("{id}|{text}") instead of plain token IDs. The ID portion uses the bare form by default, or the macula-prefixed form when withmaculaprefix=True. https://claude.ai/code/session_019ZWN6Aj7hG62s81j1wmgQ7

Copilot

Pull request overview

This PR enhances Burrito alignment serialization by optionally replacing selector IDs with tokenstr ("{id}|{text}") when token dictionaries are provided, enabling more human-readable JSON output while keeping defaults backward-compatible.

Changes:

Added optional source_tokens / target_tokens parameters to AlignmentRecord.asdict(), AlignmentGroup.asdict(), and TopLevelGroups.asdict().
Implemented selector-to-tokenstr replacement (source supports bare IDs by default and Macula-prefixed IDs when withmaculaprefix=True).
Added unit tests for AlignmentRecord.asdict() and AlignmentGroup.asdict() token-dict behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`biblealignlib/burrito/AlignmentGroup.py`	Adds token-dict serialization options to `asdict()` methods and threads token dicts through group/top-level serialization.
`tests/biblealignlib/burrito/test_AlignmentGroup.py`	Adds tests validating tokenstr serialization for record/group `asdict()` when token dictionaries are provided.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T22:25:22Z

biblealignlib/burrito/AlignmentGroup.py

+                    hoist=hoist, source_tokens=source_tokens, target_tokens=target_tokens
+                ),
+            ],
        }


TopLevelGroups.asdict() now accepts source_tokens/target_tokens and forwards them into each group, but there isn't a test exercising this new behavior. Please add a unit test that calls TopLevelGroups.asdict(source_tokens=..., target_tokens=...) and asserts selectors are tokenstr-formatted in the nested records (and that missing tokens fall back to plain IDs).

Suggested change

}

}

def test_toplevelgroups_asdict_forwards_tokens() -> None:

"""Unit test for TopLevelGroups.asdict token forwarding and selector formatting.

This uses dummy group objects so we do not depend on AlignmentGroup internals.

"""

class _DummyMeta:

def __init__(self, conforms_to: str) -> None:

self.conformsTo = conforms_to

class _DummyDoc:

def __init__(self, docid: str) -> None:

self.docid = docid

class _DummyGroup:

def __init__(self, canon: str, sourcedocid: str, target_docid: str) -> None:

# attributes used in TopLevelGroups.__post_init__

self.roles = ["source", "target"]

self.meta = _DummyMeta(conforms_to="test-conforms-to")

self.documents = (_DummyDoc(sourcedocid), _DummyDoc(target_docid))

self.canon = canon

self.sourcedocid = sourcedocid

def asdict(

self,

hoist: bool = True,

source_tokens: Optional[dict[str, Any]] = None,

target_tokens: Optional[dict[str, Any]] = None,

) -> dict[str, Any]:

# Simulate tokenstr formatting with fallback to plain IDs.

source_tokens = source_tokens or {}

target_tokens = target_tokens or {}

def _src(id_: str) -> str:

return source_tokens.get(id_, id_)

def _tgt(id_: str) -> str:

return target_tokens.get(id_, id_)

return {

"alignments": [

{

"source": [

{"selector": _src("s1")}, # has token

{"selector": _src("s_missing")}, # no token

],

"target": [

{"selector": _tgt("t1")}, # has token

{"selector": _tgt("t_missing")}, # no token

],

}

]

}

# Prepare dummy groups: one OT and one NT, sharing target docid.

target_docid = "target-doc"

group_ot = _DummyGroup(canon="ot", sourcedocid="source-ot", target_docid=target_docid)

group_nt = _DummyGroup(canon="nt", sourcedocid="source-nt", target_docid=target_docid)

tgroups = TopLevelGroups(groups=(group_ot, group_nt))

source_tokens = {"s1": "GEN.1.1!1"}

target_tokens = {"t1": "GEN.1.1!1"}

result = tgroups.asdict(source_tokens=source_tokens, target_tokens=target_tokens)

# Ensure we have two groups in the serialized structure.

assert len(result["groups"]) == 2

first_group = result["groups"][0]

align = first_group["alignments"][0]

# Token-present IDs should be replaced with token strings.

assert align["source"][0]["selector"] == "GEN.1.1!1"

assert align["target"][0]["selector"] == "GEN.1.1!1"

# Missing IDs should fall back to plain IDs.

assert align["source"][1]["selector"] == "s_missing"

assert align["target"][1]["selector"] == "t_missing"

Copilot · 2026-03-24T22:25:23Z

biblealignlib/burrito/AlignmentGroup.py

+        source_tokens: Optional[dict[str, Any]] = None,
+        target_tokens: Optional[dict[str, Any]] = None,
+    ) -> dict[str, Any]:
+        """Return an opionated dict of values suitable for serialization.


Typo in docstring: "opionated" should be "opinionated".

Suggested change

"""Return an opionated dict of values suitable for serialization.

"""Return an opinionated dict of values suitable for serialization.

Add optional source_tokens and target_tokens parameters to write_alignment_group() so callers can write token selectors as "{id}|{text}" tokenstr representations. The parameters are passed through to AlignmentRecord.asdict() via the inner _record_dict helper. Also adds TestWriteAlignmentGroup with five tests covering default (plain IDs), source-only, target-only, both, and JSON validity. https://claude.ai/code/session_019ZWN6Aj7hG62s81j1wmgQ7

Add strip_tokenstr() helper that extracts the ID from a "{id}|{text}" tokenstr selector, leaving plain IDs unchanged. Use it in: - macula_unprefixer(): strips text suffix before checking for the 'n'/'o' corpus prefix, so tokenstr selectors from written-with-token-dicts JSON are normalised correctly on re-read. - AlignmentsReader._targetid(): strips text suffix before applying the word-part truncation length check. Also exports strip_tokenstr from biblealignlib.burrito and adds TestStripTokenstr plus new tokenstr test cases in TestMacula_Unprefixer. https://claude.ai/code/session_019ZWN6Aj7hG62s81j1wmgQ7

sboisen requested a review from Copilot March 24, 2026 22:22

Copilot started reviewing on behalf of sboisen March 24, 2026 22:22 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

claude added 2 commits March 24, 2026 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenstr serialization option to AlignmentGroup asdict methods#2

Add tokenstr serialization option to AlignmentGroup asdict methods#2
sboisen wants to merge 3 commits intomainfrom
claude/add-tokenstr-serialization-4Zn7N

sboisen commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        }
+        }
+def test_toplevelgroups_asdict_forwards_tokens() -> None:
+    """Unit test for TopLevelGroups.asdict token forwarding and selector formatting.
+    This uses dummy group objects so we do not depend on AlignmentGroup internals.
+    """
+    class _DummyMeta:
+        def __init__(self, conforms_to: str) -> None:
+            self.conformsTo = conforms_to
+    class _DummyDoc:
+        def __init__(self, docid: str) -> None:
+            self.docid = docid
+    class _DummyGroup:
+        def __init__(self, canon: str, sourcedocid: str, target_docid: str) -> None:
+            # attributes used in TopLevelGroups.__post_init__
+            self.roles = ["source", "target"]
+            self.meta = _DummyMeta(conforms_to="test-conforms-to")
+            self.documents = (_DummyDoc(sourcedocid), _DummyDoc(target_docid))
+            self.canon = canon
+            self.sourcedocid = sourcedocid
+        def asdict(
+            self,
+            hoist: bool = True,
+            source_tokens: Optional[dict[str, Any]] = None,
+            target_tokens: Optional[dict[str, Any]] = None,
+        ) -> dict[str, Any]:
+            # Simulate tokenstr formatting with fallback to plain IDs.
+            source_tokens = source_tokens or {}
+            target_tokens = target_tokens or {}
+            def _src(id_: str) -> str:
+                return source_tokens.get(id_, id_)
+            def _tgt(id_: str) -> str:
+                return target_tokens.get(id_, id_)
+            return {
+                "alignments": [
+                    {
+                        "source": [
+                            {"selector": _src("s1")},          # has token
+                            {"selector": _src("s_missing")},   # no token
+                        ],
+                        "target": [
+                            {"selector": _tgt("t1")},          # has token
+                            {"selector": _tgt("t_missing")},   # no token
+                        ],
+                    }
+                ]
+            }
+    # Prepare dummy groups: one OT and one NT, sharing target docid.
+    target_docid = "target-doc"
+    group_ot = _DummyGroup(canon="ot", sourcedocid="source-ot", target_docid=target_docid)
+    group_nt = _DummyGroup(canon="nt", sourcedocid="source-nt", target_docid=target_docid)
+    tgroups = TopLevelGroups(groups=(group_ot, group_nt))
+    source_tokens = {"s1": "GEN.1.1!1"}
+    target_tokens = {"t1": "GEN.1.1!1"}
+    result = tgroups.asdict(source_tokens=source_tokens, target_tokens=target_tokens)
+    # Ensure we have two groups in the serialized structure.
+    assert len(result["groups"]) == 2
+    first_group = result["groups"][0]
+    align = first_group["alignments"][0]
+    # Token-present IDs should be replaced with token strings.
+    assert align["source"][0]["selector"] == "GEN.1.1!1"
+    assert align["target"][0]["selector"] == "GEN.1.1!1"
+    # Missing IDs should fall back to plain IDs.
+    assert align["source"][1]["selector"] == "s_missing"
+    assert align["target"][1]["selector"] == "t_missing"

	"""Return an opionated dict of values suitable for serialization.
	"""Return an opinionated dict of values suitable for serialization.

Conversation

sboisen commented Mar 24, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants