Avoid double caching in mappers that derive from CachedMapper
#585
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If a mapper derives from
CachedMapper
and overridesrec
in a way that implements caching, it needs to callMapper.rec
instead ofsuper().rec
in its implementation in order to avoid executing the cache lookup/insertion logic twice. This PR intends to fix that.The naive version of this fix had the unfortunate side effect of breaking
deduplicate_data_wrappers
, because it turns out thatdeduplicate_data_wrappers
was taking advantage of this behavior inCachedMapAndCopyMapper
in a subtle way. Here's a sketch of what was happening in the previous implementation:Suppose we have 2 data wrappers
a
andb
with the same data pointer.With
super().rec
:map_fn
mapsa
to itself, then the mapper copiesa
toa'
; it caches the mappinga
->a'
(twice, once insuper().rec
and then again inrec
),map_fn
mapsb
toa
, then the mapper maps (via cache insuper().rec
call)a
toa'
; it caches the mappingb
->a'
.=> Only
a'
in output DAG.With
Mapper.rec
:map_fn
mapsa
to itself, then the mapper copiesa
toa'
; it caches the mappinga
->a'
,map_fn
mapsb
toa
, then the mapper copiesa
toa''
; it caches the mappingb
->a''
.=> Both
a'
anda''
in output DAG.@inducer I remembered this morning that I had previously looked into this last fall (and luckily I wrote down all the details in our meeting notes). Back then I decided to set it aside and wait for #515 (after that change
map_data_wrapper
is no longer creating unnecessary copies, so it avoids the issue). But this time I thought I'd take a quick stab at refactoringdeduplicate_data_wrappers
anyway. Sticking the previouscached_data_wrapper_if_present
implementation intomap_data_wrapper
should prevent the issue, because it gets rid of theCopyMapper
implementation that was creating unnecessary new data wrappers.With the changes in this PR I'm seeing a small improvement in
transform_dag
times on prediction (7% or so).