Skip to content

Add tokenstr serialization option to AlignmentGroup asdict methods#2

Open
sboisen wants to merge 3 commits intomainfrom
claude/add-tokenstr-serialization-4Zn7N
Open

Add tokenstr serialization option to AlignmentGroup asdict methods#2
sboisen wants to merge 3 commits intomainfrom
claude/add-tokenstr-serialization-4Zn7N

Conversation

@sboisen
Copy link
Collaborator

@sboisen sboisen commented Mar 24, 2026

Summary

  • Add optional source_tokens and target_tokens parameters to AlignmentRecord.asdict(), AlignmentGroup.asdict(), and TopLevelGroups.asdict()
  • When provided, these dicts (mapping bare token IDs → token objects) replace plain ID selectors with tokenstr representations ("{id}|{text}")
  • The ID portion is bare by default; uses the macula-prefixed form when withmaculaprefix=True
  • Also imports macula_unprefixer into AlignmentGroup.py to support prefix stripping during lookup

Test plan

  • test_asdict_with_source_tokens — source selectors replaced with bare tokenstr
  • test_asdict_with_source_tokens_withmaculaprefix — source selectors replaced with prefixed tokenstr
  • test_asdict_with_target_tokens — target selectors replaced with tokenstr
  • test_asdict_with_both_token_dicts — both source and target replaced
  • test_asdict_missing_token_leaves_selector — selectors not in dict fall back to plain ID
  • test_asdict_with_token_dicts (AlignmentGroup) — token dicts propagated through to records
  • All 29 existing tests continue to pass

https://claude.ai/code/session_019ZWN6Aj7hG62s81j1wmgQ7

AlignmentRecord.asdict(), AlignmentGroup.asdict(), and TopLevelGroups.asdict()
now accept optional source_tokens and target_tokens dicts (mapping bare token
IDs to token objects). When provided, selectors are replaced with tokenstr
representations ("{id}|{text}") instead of plain token IDs. The ID portion
uses the bare form by default, or the macula-prefixed form when
withmaculaprefix=True.

https://claude.ai/code/session_019ZWN6Aj7hG62s81j1wmgQ7
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances Burrito alignment serialization by optionally replacing selector IDs with tokenstr ("{id}|{text}") when token dictionaries are provided, enabling more human-readable JSON output while keeping defaults backward-compatible.

Changes:

  • Added optional source_tokens / target_tokens parameters to AlignmentRecord.asdict(), AlignmentGroup.asdict(), and TopLevelGroups.asdict().
  • Implemented selector-to-tokenstr replacement (source supports bare IDs by default and Macula-prefixed IDs when withmaculaprefix=True).
  • Added unit tests for AlignmentRecord.asdict() and AlignmentGroup.asdict() token-dict behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
biblealignlib/burrito/AlignmentGroup.py Adds token-dict serialization options to asdict() methods and threads token dicts through group/top-level serialization.
tests/biblealignlib/burrito/test_AlignmentGroup.py Adds tests validating tokenstr serialization for record/group asdict() when token dictionaries are provided.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hoist=hoist, source_tokens=source_tokens, target_tokens=target_tokens
),
],
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TopLevelGroups.asdict() now accepts source_tokens/target_tokens and forwards them into each group, but there isn't a test exercising this new behavior. Please add a unit test that calls TopLevelGroups.asdict(source_tokens=..., target_tokens=...) and asserts selectors are tokenstr-formatted in the nested records (and that missing tokens fall back to plain IDs).

Suggested change
}
}
def test_toplevelgroups_asdict_forwards_tokens() -> None:
"""Unit test for TopLevelGroups.asdict token forwarding and selector formatting.
This uses dummy group objects so we do not depend on AlignmentGroup internals.
"""
class _DummyMeta:
def __init__(self, conforms_to: str) -> None:
self.conformsTo = conforms_to
class _DummyDoc:
def __init__(self, docid: str) -> None:
self.docid = docid
class _DummyGroup:
def __init__(self, canon: str, sourcedocid: str, target_docid: str) -> None:
# attributes used in TopLevelGroups.__post_init__
self.roles = ["source", "target"]
self.meta = _DummyMeta(conforms_to="test-conforms-to")
self.documents = (_DummyDoc(sourcedocid), _DummyDoc(target_docid))
self.canon = canon
self.sourcedocid = sourcedocid
def asdict(
self,
hoist: bool = True,
source_tokens: Optional[dict[str, Any]] = None,
target_tokens: Optional[dict[str, Any]] = None,
) -> dict[str, Any]:
# Simulate tokenstr formatting with fallback to plain IDs.
source_tokens = source_tokens or {}
target_tokens = target_tokens or {}
def _src(id_: str) -> str:
return source_tokens.get(id_, id_)
def _tgt(id_: str) -> str:
return target_tokens.get(id_, id_)
return {
"alignments": [
{
"source": [
{"selector": _src("s1")}, # has token
{"selector": _src("s_missing")}, # no token
],
"target": [
{"selector": _tgt("t1")}, # has token
{"selector": _tgt("t_missing")}, # no token
],
}
]
}
# Prepare dummy groups: one OT and one NT, sharing target docid.
target_docid = "target-doc"
group_ot = _DummyGroup(canon="ot", sourcedocid="source-ot", target_docid=target_docid)
group_nt = _DummyGroup(canon="nt", sourcedocid="source-nt", target_docid=target_docid)
tgroups = TopLevelGroups(groups=(group_ot, group_nt))
source_tokens = {"s1": "GEN.1.1!1"}
target_tokens = {"t1": "GEN.1.1!1"}
result = tgroups.asdict(source_tokens=source_tokens, target_tokens=target_tokens)
# Ensure we have two groups in the serialized structure.
assert len(result["groups"]) == 2
first_group = result["groups"][0]
align = first_group["alignments"][0]
# Token-present IDs should be replaced with token strings.
assert align["source"][0]["selector"] == "GEN.1.1!1"
assert align["target"][0]["selector"] == "GEN.1.1!1"
# Missing IDs should fall back to plain IDs.
assert align["source"][1]["selector"] == "s_missing"
assert align["target"][1]["selector"] == "t_missing"

Copilot uses AI. Check for mistakes.
source_tokens: Optional[dict[str, Any]] = None,
target_tokens: Optional[dict[str, Any]] = None,
) -> dict[str, Any]:
"""Return an opionated dict of values suitable for serialization.
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in docstring: "opionated" should be "opinionated".

Suggested change
"""Return an opionated dict of values suitable for serialization.
"""Return an opinionated dict of values suitable for serialization.

Copilot uses AI. Check for mistakes.
claude added 2 commits March 24, 2026 22:27
Add optional source_tokens and target_tokens parameters to
write_alignment_group() so callers can write token selectors as
"{id}|{text}" tokenstr representations. The parameters are passed
through to AlignmentRecord.asdict() via the inner _record_dict helper.

Also adds TestWriteAlignmentGroup with five tests covering default
(plain IDs), source-only, target-only, both, and JSON validity.

https://claude.ai/code/session_019ZWN6Aj7hG62s81j1wmgQ7
Add strip_tokenstr() helper that extracts the ID from a "{id}|{text}"
tokenstr selector, leaving plain IDs unchanged. Use it in:

- macula_unprefixer(): strips text suffix before checking for the 'n'/'o'
  corpus prefix, so tokenstr selectors from written-with-token-dicts JSON
  are normalised correctly on re-read.
- AlignmentsReader._targetid(): strips text suffix before applying the
  word-part truncation length check.

Also exports strip_tokenstr from biblealignlib.burrito and adds
TestStripTokenstr plus new tokenstr test cases in TestMacula_Unprefixer.

https://claude.ai/code/session_019ZWN6Aj7hG62s81j1wmgQ7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants