Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch __init__ of cudf constructors to parse through cudf.pandas proxy objects #17878

Open
wants to merge 17 commits into
base: branch-25.04
Choose a base branch
from

Conversation

galipremsagar
Copy link
Contributor

Description

Fixes: rapidsai/cuml#6232

This PR patches Series, Index and DataFrame constructors in such a way that true objects are extracted from cudf.pandas proxy objects if they are passed to any of these constructors.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@galipremsagar galipremsagar added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Jan 30, 2025
@galipremsagar galipremsagar self-assigned this Jan 30, 2025
@galipremsagar galipremsagar requested a review from a team as a code owner January 30, 2025 19:39
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Jan 30, 2025
@galipremsagar galipremsagar changed the title Patch inits Patch __init__ of cudf constructors to parse through cudf.pandas proxy Jan 30, 2025
@galipremsagar galipremsagar changed the title Patch __init__ of cudf constructors to parse through cudf.pandas proxy Patch __init__ of cudf constructors to parse through cudf.pandas proxy objects Jan 30, 2025
Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think these changes make sense. I have some questions.

python/cudf/cudf/pandas/_wrappers/pandas.py Outdated Show resolved Hide resolved
python/cudf/cudf/pandas/_wrappers/pandas.py Show resolved Hide resolved
Comment on lines +1798 to +1800
cudf.Series.__init__ = wrap_init(_original_Series_init)
cudf.Index.__init__ = wrap_init(_original_Index_init)
cudf.DataFrame.__init__ = DataFrame_init_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame_init_ shares some logic with wrap_init. Is there a way to get everything in the same patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could but then we will be having to do the kwargs.get handling and all. For neatness purposes I kept DataFrame separate.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions and suggestions for improvement, but they aren't blocking for me so I'm approving so you aren't waiting on me.

python/cudf/cudf/pandas/_wrappers/pandas.py Show resolved Hide resolved
Comment on lines 1770 to 1773
data_extracted = False
if is_proxy_object(data):
data = data.as_gpu_object()
data_extracted = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data_extracted = False
if is_proxy_object(data):
data = data.as_gpu_object()
data_extracted = True
data_is_proxy = is_proxy_object(data)
if data_is_proxy:
data = data.as_gpu_object()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could even walrus it but I don't like those in cases where the lifetimes are not obvious (i.e. we'd be relying on data_is_proxy persisting beyond the conditional where it is created, which is valid but less readable IMHO).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it 👍



def initial_setup():
init_patching()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why split up initial_setup and init_patching? They seem like they should just be one function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking initial_setup would serve as a utility for all future initialization-specific stuff. Merged both into one.


def test_cudf_series_from_cudf_pandas():
s = xpd.Series([1, 2, 3])
gs = cudf.Series(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line would have worked even without the changes in this PR, right? It would just have been extra slow because we would have turned the Series into a pandas Series and then back into a cudf Series? Is there a way that we can test that? For example, monkey-patch pd.Series.__init__ to always raise then run this test? pytest-xdist is multi-process, not multithreaded, so we shouldn't have to worry about locking to avoid breaking other tests.

We could also do something slightly less direct like monkey-patching as_gpu_object to ensure that it's called. That doesn't feel like as solid of a test, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, let me try this out..

Copy link
Contributor Author

@galipremsagar galipremsagar Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah that won't be a robust test either. We will either have to time it or cprofile the call and validate the call-stack. Timing is not so stable on CI runs. I'll work on adding a profiled pytest but will do so in a later PR (targeting 24.04).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyasr I have added more robust testing that will validate what we are calling.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat negative to this level of complexity and wonder if we can get away with just unwrapping in the "simple" (hopefully most common?) case

python/cudf/cudf/pandas/_wrappers/pandas.py Show resolved Hide resolved
python/cudf/cudf/pandas/_wrappers/pandas.py Outdated Show resolved Hide resolved
python/cudf/cudf/pandas/_wrappers/pandas.py Outdated Show resolved Hide resolved
):
self.__dict__.update(data.__dict__)
return
_original_DataFrame_init(self, data, index, columns, *args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr: I think this is trying to promise too much.

I have to say I am rather opposed to this level of complexity. The dataframe and series constructors in cudf itself are already very complicated and this adds another level of introspection and area for divergence on top of what we already have.

I think I had anticipated that this patching would basically have been to try and intercept the case of cudf.DataFrame(cudf-pandas.DataFrame) and cudf.Series(cudf-pandas.Series) (and, I suppose index/multiindex, but I would view those as less critical).

I think unwrapping the object in the case of cudf.DataFrame(proxy_obj) is OK, but this recursive unwrapping is not something I am in favour of.

Do we catch the majority of the use cases with something like:

def wrapper(data, *args, **kwargs):
    if is_proxy_object(data) and not (args or kwargs):
        return wrapped_init(data.as_gpu_object())
    # OK, too complicated, just punt
    return wrapped_init(data, *args, **kwargs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing offline, simplified to the above suggested approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are slipping this PR to 25.04, I would suggest that we try to optimize the other cases in the right way: by speeding up our constructors. I expect that with @mroeschke's work we'll be able to start chipping away at a lot of cudf's internal overhead as we rewrite cudf classic's internal object model. That should make this kind of hackery less and less necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyasr Is there an issue that is tracking constructor refactoring/ internal object model rewrite? Until that is done can we proceed and merge this pr?

@galipremsagar galipremsagar requested a review from wence- January 31, 2025 16:02
@galipremsagar galipremsagar changed the base branch from branch-25.02 to branch-25.04 January 31, 2025 16:39
@galipremsagar galipremsagar requested review from a team as code owners January 31, 2025 16:39
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. cudf.polars Issues specific to cudf.polars labels Jan 31, 2025
@github-actions github-actions bot removed libcudf Affects libcudf (C++/CUDA) code. cudf.polars Issues specific to cudf.polars labels Jan 31, 2025
@galipremsagar galipremsagar removed request for a team and karthikeyann January 31, 2025 16:49
@galipremsagar galipremsagar requested a review from vyasr February 4, 2025 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cudf.pandas Issues specific to cudf.pandas non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[BUG] - cuML LabelEncoder is 200x slower with cuDF-Pandas vs cuDF
6 participants