-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch __init__
of cudf
constructors to parse through cudf.pandas
proxy objects
#17878
base: branch-25.04
Are you sure you want to change the base?
Conversation
__init__
of cudf
constructors to parse through cudf.pandas
proxy
__init__
of cudf
constructors to parse through cudf.pandas
proxy__init__
of cudf
constructors to parse through cudf.pandas
proxy objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think these changes make sense. I have some questions.
cudf.Series.__init__ = wrap_init(_original_Series_init) | ||
cudf.Index.__init__ = wrap_init(_original_Index_init) | ||
cudf.DataFrame.__init__ = DataFrame_init_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame_init_
shares some logic with wrap_init
. Is there a way to get everything in the same patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we could but then we will be having to do the kwargs.get
handling and all. For neatness purposes I kept DataFrame separate.
Co-authored-by: Matthew Murray <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions and suggestions for improvement, but they aren't blocking for me so I'm approving so you aren't waiting on me.
data_extracted = False | ||
if is_proxy_object(data): | ||
data = data.as_gpu_object() | ||
data_extracted = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_extracted = False | |
if is_proxy_object(data): | |
data = data.as_gpu_object() | |
data_extracted = True | |
data_is_proxy = is_proxy_object(data) | |
if data_is_proxy: | |
data = data.as_gpu_object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could even walrus it but I don't like those in cases where the lifetimes are not obvious (i.e. we'd be relying on data_is_proxy
persisting beyond the conditional where it is created, which is valid but less readable IMHO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it 👍
|
||
|
||
def initial_setup(): | ||
init_patching() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why split up initial_setup
and init_patching
? They seem like they should just be one function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking initial_setup
would serve as a utility for all future initialization-specific stuff. Merged both into one.
|
||
def test_cudf_series_from_cudf_pandas(): | ||
s = xpd.Series([1, 2, 3]) | ||
gs = cudf.Series(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line would have worked even without the changes in this PR, right? It would just have been extra slow because we would have turned the Series into a pandas Series and then back into a cudf Series? Is there a way that we can test that? For example, monkey-patch pd.Series.__init__
to always raise then run this test? pytest-xdist is multi-process, not multithreaded, so we shouldn't have to worry about locking to avoid breaking other tests.
We could also do something slightly less direct like monkey-patching as_gpu_object
to ensure that it's called. That doesn't feel like as solid of a test, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, let me try this out..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah that won't be a robust test either. We will either have to time it or cprofile the call and validate the call-stack. Timing is not so stable on CI runs. I'll work on adding a profiled pytest but will do so in a later PR (targeting 24.04).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vyasr I have added more robust testing that will validate what we are calling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat negative to this level of complexity and wonder if we can get away with just unwrapping in the "simple" (hopefully most common?) case
): | ||
self.__dict__.update(data.__dict__) | ||
return | ||
_original_DataFrame_init(self, data, index, columns, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr: I think this is trying to promise too much.
I have to say I am rather opposed to this level of complexity. The dataframe and series constructors in cudf itself are already very complicated and this adds another level of introspection and area for divergence on top of what we already have.
I think I had anticipated that this patching would basically have been to try and intercept the case of cudf.DataFrame(cudf-pandas.DataFrame)
and cudf.Series(cudf-pandas.Series)
(and, I suppose index/multiindex, but I would view those as less critical).
I think unwrapping the object in the case of cudf.DataFrame(proxy_obj)
is OK, but this recursive unwrapping is not something I am in favour of.
Do we catch the majority of the use cases with something like:
def wrapper(data, *args, **kwargs):
if is_proxy_object(data) and not (args or kwargs):
return wrapped_init(data.as_gpu_object())
# OK, too complicated, just punt
return wrapped_init(data, *args, **kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing offline, simplified to the above suggested approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we are slipping this PR to 25.04, I would suggest that we try to optimize the other cases in the right way: by speeding up our constructors. I expect that with @mroeschke's work we'll be able to start chipping away at a lot of cudf's internal overhead as we rewrite cudf classic's internal object model. That should make this kind of hackery less and less necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vyasr Is there an issue that is tracking constructor refactoring/ internal object model rewrite? Until that is done can we proceed and merge this pr?
Description
Fixes: rapidsai/cuml#6232
This PR patches
Series
,Index
andDataFrame
constructors in such a way that true objects are extracted fromcudf.pandas
proxy objects if they are passed to any of these constructors.Checklist