Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid double caching in mappers that derive from CachedMapper #585

Merged
merged 5 commits into from
Mar 10, 2025

Conversation

majosm
Copy link
Collaborator

@majosm majosm commented Feb 21, 2025

If a mapper derives from CachedMapper and overrides rec in a way that implements caching, it needs to call Mapper.rec instead of super().rec in its implementation in order to avoid executing the cache lookup/insertion logic twice. This PR intends to fix that.

The naive version of this fix had the unfortunate side effect of breaking deduplicate_data_wrappers, because it turns out that deduplicate_data_wrappers was taking advantage of this behavior in CachedMapAndCopyMapper in a subtle way. Here's a sketch of what was happening in the previous implementation:

Suppose we have 2 data wrappers a and b with the same data pointer.

With super().rec:

  1. map_fn maps a to itself, then the mapper copies a to a'; it caches the mapping a -> a' (twice, once in super().rec and then again in rec),
  2. map_fn maps b to a, then the mapper maps (via cache in super().rec call) a to a'; it caches the mapping b -> a'.

=> Only a' in output DAG.

With Mapper.rec:

  1. map_fn maps a to itself, then the mapper copies a to a'; it caches the mapping a -> a',
  2. map_fn maps b to a, then the mapper copies a to a''; it caches the mapping b -> a''.

=> Both a' and a'' in output DAG.

@inducer I remembered this morning that I had previously looked into this last fall (and luckily I wrote down all the details in our meeting notes). Back then I decided to set it aside and wait for #515 (after that change map_data_wrapper is no longer creating unnecessary copies, so it avoids the issue). But this time I thought I'd take a quick stab at refactoring deduplicate_data_wrappers anyway. Sticking the previous cached_data_wrapper_if_present implementation into map_data_wrapper should prevent the issue, because it gets rid of the CopyMapper implementation that was creating unnecessary new data wrappers.

With the changes in this PR I'm seeing a small improvement in transform_dag times on prediction (7% or so).

@majosm majosm marked this pull request as ready for review February 21, 2025 21:21
@majosm majosm requested a review from inducer February 21, 2025 21:22
@majosm
Copy link
Collaborator Author

majosm commented Feb 21, 2025

Ready for a look. Any idea what's up with the Doc CI? (Fixed by #586.)

…uper().rec usage in CachedMapAndCopyMapper

Here is a sketch of what happens with super().rec vs Mapper.rec for the previous implementation of
deduplicate_data_wrappers. Suppose we have 2 data wrappers a and b with the same data pointer.

With super().rec:

1) map_fn maps a to itself, then mapper copies a to a'; mapper caches a -> a' (twice, once in
   super().rec and then again in rec),
2) map_fn maps b to a, then mapper maps (via cache in super().rec call) a to a'; mapper caches
   b -> a'.

=> Only a' in output DAG.

With Mapper.rec:

1) map_fn maps a to itself, then mapper copies a to a'; caches a -> a',
2) map_fn maps b to a, then mapper copies a to a''; caches b -> a''.

=> Both a' and a'' in output DAG.
@majosm majosm force-pushed the avoid-double-caching branch from 5e6d355 to 4ff4017 Compare February 25, 2025 13:37
This was referenced Feb 27, 2025
@inducer inducer enabled auto-merge (squash) March 10, 2025 21:58
@inducer inducer merged commit ec98289 into inducer:main Mar 10, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants