Add tokenstr serialization option to AlignmentGroup asdict methods#2
Add tokenstr serialization option to AlignmentGroup asdict methods#2
Conversation
AlignmentRecord.asdict(), AlignmentGroup.asdict(), and TopLevelGroups.asdict()
now accept optional source_tokens and target_tokens dicts (mapping bare token
IDs to token objects). When provided, selectors are replaced with tokenstr
representations ("{id}|{text}") instead of plain token IDs. The ID portion
uses the bare form by default, or the macula-prefixed form when
withmaculaprefix=True.
https://claude.ai/code/session_019ZWN6Aj7hG62s81j1wmgQ7
There was a problem hiding this comment.
Pull request overview
This PR enhances Burrito alignment serialization by optionally replacing selector IDs with tokenstr ("{id}|{text}") when token dictionaries are provided, enabling more human-readable JSON output while keeping defaults backward-compatible.
Changes:
- Added optional
source_tokens/target_tokensparameters toAlignmentRecord.asdict(),AlignmentGroup.asdict(), andTopLevelGroups.asdict(). - Implemented selector-to-tokenstr replacement (source supports bare IDs by default and Macula-prefixed IDs when
withmaculaprefix=True). - Added unit tests for
AlignmentRecord.asdict()andAlignmentGroup.asdict()token-dict behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
biblealignlib/burrito/AlignmentGroup.py |
Adds token-dict serialization options to asdict() methods and threads token dicts through group/top-level serialization. |
tests/biblealignlib/burrito/test_AlignmentGroup.py |
Adds tests validating tokenstr serialization for record/group asdict() when token dictionaries are provided. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| hoist=hoist, source_tokens=source_tokens, target_tokens=target_tokens | ||
| ), | ||
| ], | ||
| } |
There was a problem hiding this comment.
TopLevelGroups.asdict() now accepts source_tokens/target_tokens and forwards them into each group, but there isn't a test exercising this new behavior. Please add a unit test that calls TopLevelGroups.asdict(source_tokens=..., target_tokens=...) and asserts selectors are tokenstr-formatted in the nested records (and that missing tokens fall back to plain IDs).
| } | |
| } | |
| def test_toplevelgroups_asdict_forwards_tokens() -> None: | |
| """Unit test for TopLevelGroups.asdict token forwarding and selector formatting. | |
| This uses dummy group objects so we do not depend on AlignmentGroup internals. | |
| """ | |
| class _DummyMeta: | |
| def __init__(self, conforms_to: str) -> None: | |
| self.conformsTo = conforms_to | |
| class _DummyDoc: | |
| def __init__(self, docid: str) -> None: | |
| self.docid = docid | |
| class _DummyGroup: | |
| def __init__(self, canon: str, sourcedocid: str, target_docid: str) -> None: | |
| # attributes used in TopLevelGroups.__post_init__ | |
| self.roles = ["source", "target"] | |
| self.meta = _DummyMeta(conforms_to="test-conforms-to") | |
| self.documents = (_DummyDoc(sourcedocid), _DummyDoc(target_docid)) | |
| self.canon = canon | |
| self.sourcedocid = sourcedocid | |
| def asdict( | |
| self, | |
| hoist: bool = True, | |
| source_tokens: Optional[dict[str, Any]] = None, | |
| target_tokens: Optional[dict[str, Any]] = None, | |
| ) -> dict[str, Any]: | |
| # Simulate tokenstr formatting with fallback to plain IDs. | |
| source_tokens = source_tokens or {} | |
| target_tokens = target_tokens or {} | |
| def _src(id_: str) -> str: | |
| return source_tokens.get(id_, id_) | |
| def _tgt(id_: str) -> str: | |
| return target_tokens.get(id_, id_) | |
| return { | |
| "alignments": [ | |
| { | |
| "source": [ | |
| {"selector": _src("s1")}, # has token | |
| {"selector": _src("s_missing")}, # no token | |
| ], | |
| "target": [ | |
| {"selector": _tgt("t1")}, # has token | |
| {"selector": _tgt("t_missing")}, # no token | |
| ], | |
| } | |
| ] | |
| } | |
| # Prepare dummy groups: one OT and one NT, sharing target docid. | |
| target_docid = "target-doc" | |
| group_ot = _DummyGroup(canon="ot", sourcedocid="source-ot", target_docid=target_docid) | |
| group_nt = _DummyGroup(canon="nt", sourcedocid="source-nt", target_docid=target_docid) | |
| tgroups = TopLevelGroups(groups=(group_ot, group_nt)) | |
| source_tokens = {"s1": "GEN.1.1!1"} | |
| target_tokens = {"t1": "GEN.1.1!1"} | |
| result = tgroups.asdict(source_tokens=source_tokens, target_tokens=target_tokens) | |
| # Ensure we have two groups in the serialized structure. | |
| assert len(result["groups"]) == 2 | |
| first_group = result["groups"][0] | |
| align = first_group["alignments"][0] | |
| # Token-present IDs should be replaced with token strings. | |
| assert align["source"][0]["selector"] == "GEN.1.1!1" | |
| assert align["target"][0]["selector"] == "GEN.1.1!1" | |
| # Missing IDs should fall back to plain IDs. | |
| assert align["source"][1]["selector"] == "s_missing" | |
| assert align["target"][1]["selector"] == "t_missing" |
| source_tokens: Optional[dict[str, Any]] = None, | ||
| target_tokens: Optional[dict[str, Any]] = None, | ||
| ) -> dict[str, Any]: | ||
| """Return an opionated dict of values suitable for serialization. |
There was a problem hiding this comment.
Typo in docstring: "opionated" should be "opinionated".
| """Return an opionated dict of values suitable for serialization. | |
| """Return an opinionated dict of values suitable for serialization. |
Add optional source_tokens and target_tokens parameters to
write_alignment_group() so callers can write token selectors as
"{id}|{text}" tokenstr representations. The parameters are passed
through to AlignmentRecord.asdict() via the inner _record_dict helper.
Also adds TestWriteAlignmentGroup with five tests covering default
(plain IDs), source-only, target-only, both, and JSON validity.
https://claude.ai/code/session_019ZWN6Aj7hG62s81j1wmgQ7
Add strip_tokenstr() helper that extracts the ID from a "{id}|{text}"
tokenstr selector, leaving plain IDs unchanged. Use it in:
- macula_unprefixer(): strips text suffix before checking for the 'n'/'o'
corpus prefix, so tokenstr selectors from written-with-token-dicts JSON
are normalised correctly on re-read.
- AlignmentsReader._targetid(): strips text suffix before applying the
word-part truncation length check.
Also exports strip_tokenstr from biblealignlib.burrito and adds
TestStripTokenstr plus new tokenstr test cases in TestMacula_Unprefixer.
https://claude.ai/code/session_019ZWN6Aj7hG62s81j1wmgQ7
Summary
source_tokensandtarget_tokensparameters toAlignmentRecord.asdict(),AlignmentGroup.asdict(), andTopLevelGroups.asdict()tokenstrrepresentations ("{id}|{text}")withmaculaprefix=Truemacula_unprefixerintoAlignmentGroup.pyto support prefix stripping during lookupTest plan
test_asdict_with_source_tokens— source selectors replaced with bare tokenstrtest_asdict_with_source_tokens_withmaculaprefix— source selectors replaced with prefixed tokenstrtest_asdict_with_target_tokens— target selectors replaced with tokenstrtest_asdict_with_both_token_dicts— both source and target replacedtest_asdict_missing_token_leaves_selector— selectors not in dict fall back to plain IDtest_asdict_with_token_dicts(AlignmentGroup) — token dicts propagated through to recordshttps://claude.ai/code/session_019ZWN6Aj7hG62s81j1wmgQ7