[data] fix KeyError on unmapped role tag in openai converter main loop#10606
[data] fix KeyError on unmapped role tag in openai converter main loop#10606he-yufeng wants to merge 1 commit into
Conversation
The OpenAIDatasetConverter builds aligned_messages with tag_mapping[role] but never checks role against tag_mapping first, so a single message whose role tag is not in the mapping raises KeyError and aborts the whole dataset conversion. The later validity loop runs on aligned_messages, which is too late to catch it. The SharegptDatasetConverter main loop already guards this case and skips the example, so this just brings the openai path in line. Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request introduces validation in the dataset converter to skip messages with invalid role tags, preventing potential KeyError crashes, and adds a corresponding unit test. The review feedback suggests explicitly checking if the role is None to ensure robust validation and prevent unexpected behavior when certain tags are unconfigured.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if role not in tag_mapping: | ||
| logger.warning_rank0(f"Invalid role tag in {messages}.") | ||
| broken_data = True | ||
| break |
There was a problem hiding this comment.
If role is None (e.g., due to a malformed message) and any of the dataset attributes in tag_mapping (such as system_tag or observation_tag) are also None (unconfigured), role not in tag_mapping will evaluate to False. This can lead to incorrect mapping or unexpected behavior.
We should explicitly guard against role being None to ensure robust validation.
| if role not in tag_mapping: | |
| logger.warning_rank0(f"Invalid role tag in {messages}.") | |
| broken_data = True | |
| break | |
| if role is None or role not in tag_mapping: | |
| logger.warning_rank0(f"Invalid role tag in {messages}.") | |
| broken_data = True | |
| break |
What does this PR do?
OpenAIDatasetConverter.__call__buildsaligned_messageslike this:roleis never checked againsttag_mappingbefore this lookup, so a singlemessage whose role tag isn't one of the configured tags (
user_tag,assistant_tag,observation_tag,function_tag,system_tag) raises aKeyErrorand takes down the wholedatasets.mapconversion. The validityloop right below only runs on
aligned_messagesafter they're built, so itnever gets the chance to flag the row.
The
SharegptDatasetConvertermain loop already guards this exact case andjust skips the malformed example with a warning:
This adds the matching guard to the openai converter so one bad row gets
skipped with a warning instead of aborting the run.
Repro on
main(the new test fails withKeyError: 'not_a_role'before thefix, passes after):
This is a sibling of #10601, which fixes the same
tag_mappingKeyErrorclass on the ranking/pairwise
chosen/rejectedpath. This one is theregular (non-ranking) main message loop, a different code path; the diffs
don't overlap.
Before submitting
tests/data/test_converter.py::test_openai_converter_skips_invalid_role)