Skip to content

BUG: Fix #46726; wrong result with varying window size min/max rolling calc. #61288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

viable-alternative
Copy link

@viable-alternative viable-alternative commented Apr 14, 2025

Summary:

  • Fixes a 3-year-old bug with incorrect min/max rolling calculation for custom window sizes. Adds an error check for invalid inputs.
  • Speed improvement of ~10% by not using additional queue for handling NaNs.
  • For complex cases, incurs additional multiplicative log(k) complexity (where k is max window size), but this cost is only incurred in cases that produced invalid result before. For example, for constant window size this cost is not incurred.

Changed behavior:

Has additional validity check, which will raise ValueError if the function detects a condition it cannot work with, namely improper ordering of start/end bounds. The existing method would happily consume such input and would produce a wrong result. There is a new unit test to check for the raised ValueError.

Note on invalid inputs:

It is possible to make the method work for an arbitrary stream of start/end window bounds, but it will require sorting. It is very unlikely that such work is worth the effort, and it is estimated to have extremely low need, if any. Let someone create an enhancement request first.
If sorting is to be implemented: it can be done with only incurring performance hit in the case of unsorted input: copy and sort the start/end arrays, producing a permutation, run the main method on the copy, and then extract the result back using the permutation. To detect if the start/end array pair is properly sorted will only take O(N). (Soring is N*log(N), does not have to be stable, but the input array is extremely likely to be “almost” sorted, and you have to pick your poison of a sorting method that works well with nearly sorted array, or use efficient soring methods, most of which do not offer additional speed on nearly sorted arrays.) Working such intermediate step (without copying and pasting) into 3 different implementations will require some less than straightforward work in the “apply” family of methods used by other rolling functions, and therefore will bear risk. If this is decided to be done, it is recommended to have an additional parameter to optionally skip the “sorted” check. (The user may already know that the arrays are properly sorted).

How to Debug numba

You can temporarily change 2 lines of code in order to Python-debug numba implementation with VS Code or another Python debugger:

  • Comment out the numba.jit decorator on the function(sliding_min_max() in min_max_.py).
  • Do the same with the column_looper() function defined inside the generate_apply_looper() function in executor.py.
  • Your breakpoint inside the function will now hit!

Misc Notes

The speed improvement of ~10% was confirmed in two ways:

  • As measured by pandas’ supplied asv benchmark suite (0.80-0.91 coefficient (depending on particular test) on my hardware).
  • With a custom load test over a 2MM-long rolling window on a 300MM-long data set. (See the supplied bench.py.txt.) A single run of the test takes approx. 6-8 seconds and consumes ~15GB of RAM on a 32-GB RAM PC.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

Before fully reviewing, it would be great if some things conformed more closely to our codebase, namely

  1. Remove bug_hunters directory
  2. Remove pandas/core/window/_min_max.py if it's not used by the Cython or Numba implementation
  3. Instead of a separate test_minmax.py create tests that would close the original issue in tests/window/test_rolling.py and tests/window/test_numba.py

@viable-alternative
Copy link
Author

@mroeschke, all done!
Edited my original comment to reflect what is actually being done. Thx!
(CI is still in progress. I will fix ti up if anything fails.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min
3 participants