Skip to content

[data] fix KeyError on unmapped role tag in openai converter main loop#10606

Open
he-yufeng wants to merge 1 commit into
hiyouga:mainfrom
he-yufeng:fix/openai-converter-invalid-role-keyerror
Open

[data] fix KeyError on unmapped role tag in openai converter main loop#10606
he-yufeng wants to merge 1 commit into
hiyouga:mainfrom
he-yufeng:fix/openai-converter-invalid-role-keyerror

Conversation

@he-yufeng

Copy link
Copy Markdown

What does this PR do?

OpenAIDatasetConverter.__call__ builds aligned_messages like this:

aligned_messages.append(
    {
        "role": tag_mapping[role],
        "content": content,
    }
)

role is never checked against tag_mapping before this lookup, so a single
message whose role tag isn't one of the configured tags (user_tag,
assistant_tag, observation_tag, function_tag, system_tag) raises a
KeyError and takes down the whole datasets.map conversion. The validity
loop right below only runs on aligned_messages after they're built, so it
never gets the chance to flag the row.

The SharegptDatasetConverter main loop already guards this exact case and
just skips the malformed example with a warning:

if message[self.dataset_attr.role_tag] not in accept_tags[turn_idx % 2]:
    logger.warning_rank0(f"Invalid role tag in {messages}.")
    broken_data = True
    break

This adds the matching guard to the openai converter so one bad row gets
skipped with a warning instead of aborting the run.

Repro on main (the new test fails with KeyError: 'not_a_role' before the
fix, passes after):

example = {
    "conversations": [
        {"from": "human", "value": "Solve the math problem.\n3 + 4"},
        {"from": "not_a_role", "value": "The answer is 7."},
    ]
}
get_dataset_converter("openai", dataset_attr, data_args)(example)

This is a sibling of #10601, which fixes the same tag_mapping KeyError
class on the ranking/pairwise chosen/rejected path. This one is the
regular (non-ranking) main message loop, a different code path; the diffs
don't overlap.

Before submitting

  • Did you read the contributor guideline?
  • Did you write any new necessary tests? (tests/data/test_converter.py::test_openai_converter_skips_invalid_role)

The OpenAIDatasetConverter builds aligned_messages with tag_mapping[role]
but never checks role against tag_mapping first, so a single message whose
role tag is not in the mapping raises KeyError and aborts the whole dataset
conversion. The later validity loop runs on aligned_messages, which is too
late to catch it. The SharegptDatasetConverter main loop already guards this
case and skips the example, so this just brings the openai path in line.

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces validation in the dataset converter to skip messages with invalid role tags, preventing potential KeyError crashes, and adds a corresponding unit test. The review feedback suggests explicitly checking if the role is None to ensure robust validation and prevent unexpected behavior when certain tags are unconfigured.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +278 to +281
if role not in tag_mapping:
logger.warning_rank0(f"Invalid role tag in {messages}.")
broken_data = True
break

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If role is None (e.g., due to a malformed message) and any of the dataset attributes in tag_mapping (such as system_tag or observation_tag) are also None (unconfigured), role not in tag_mapping will evaluate to False. This can lead to incorrect mapping or unexpected behavior.

We should explicitly guard against role being None to ensure robust validation.

Suggested change
if role not in tag_mapping:
logger.warning_rank0(f"Invalid role tag in {messages}.")
broken_data = True
break
if role is None or role not in tag_mapping:
logger.warning_rank0(f"Invalid role tag in {messages}.")
broken_data = True
break

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant