Skip to content

Conversation

@alhridoy
Copy link

@alhridoy alhridoy commented Sep 9, 2025

Description

Fixes #4035 - CI fails with ValueError: zero-size array to reduction operation maximum which has no identity when using use_transformers_paged=True in OnlineDPOTrainer.

Root Cause

The bug occurred in the pad function in trl/trainer/utils.py when generate_batch returns empty results, leading to an empty completion_ids list. When this empty list is passed to the pad function, numpy.max([t.shape for t in tensors], 0) fails because numpy cannot compute the maximum of an empty array.

The issue affects three trainers:

  • OnlineDPOTrainer
  • GRPOTrainer
  • RLOOTrainer

Solution

Implemented a two-layer fix:

1. Fixed the pad function to handle empty lists gracefully

def pad(tensors: list[torch.Tensor], ...):
    # Handle empty tensors list
    if not tensors:
        return torch.empty((0,), dtype=torch.int64)
    
    # Original logic continues...

2. Added defensive checks in affected trainers

Added checks to handle cases where no completions are generated, returning appropriate empty tensors instead of calling pad with empty lists.

3. Added comprehensive unit tests

Added three new test cases in tests/test_utils.py to ensure the pad function handles empty lists correctly with different parameters.

…e#4035)

- Fix pad function in utils.py to handle empty tensors list gracefully
- Add defensive checks in OnlineDPOTrainer, GRPOTrainer, and RLOOTrainer
- Add comprehensive unit tests for empty list handling
- Resolves CI failures with use_transformers_paged=True

The bug occurred when generate_batch returns no outputs, leading to an empty
completion_ids list. When passed to pad(), numpy.max() would fail with:
'ValueError: zero-size array to reduction operation maximum which has no identity'

This fix provides two layers of protection:
1. pad() function now returns empty tensor for empty input lists
2. Trainers handle empty completion_ids gracefully before calling pad()

Fixes: huggingface#4035
@kashif
Copy link
Collaborator

kashif commented Sep 9, 2025

@alhridoy the reason its empty is due to a transformer's bug, which is now fixed in main huggingface/transformers#40692

perhaps instead of these changes we can comment out the tests for now?

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, @alhridoy.

However, I agree with @kashif: this PR seems to address the consequence rather than the root cause of the underlying issue:

  • Why completion_ids is empty in the first place? The empty list indicates that the paged attention generation failed completely, which suggests a more fundamental problem.
  • With your fix, users will now get a clearer error message, but the underlying functionality (paged attention generation) remains broken. The training will still fail, just with a different error.
  • The original error provided a breadcrumb trail to the root cause. This fix might make it harder to diagnose the upstream issue.

@kashif if the PR you mentioned already fixes the upstream issue, I would propose to temporarily (until the patch is released) mark the failing tests so the CI becomes green. I can take care of that.

@kashif
Copy link
Collaborator

kashif commented Sep 9, 2025

yes @albertvillanova that would be awesome if we can temp. disable the failing tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI fails: ValueError: zero-size array to reduction operation maximum which has no identity

3 participants