-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize L2P for GPUs #158
base: main
Are you sure you want to change the base?
Conversation
f39226d
to
95132cb
Compare
66e1fa4
to
cfe3567
Compare
How come the CIs are failing? Is there an unmet dependency somewhere? |
Bad merge. Hopefully fixed now. |
154bf81
to
d89520a
Compare
sumpy/expansion/local.py
Outdated
@@ -405,6 +410,14 @@ def loopy_translate_from(self, src_expansion): | |||
f"A direct loopy kernel for translation from " | |||
f"{src_expansion} to {self} is not implemented.") | |||
|
|||
def get_loopy_evaluator(self, kernels: Sequence[Kernel]) -> lp.TranslationUnit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be named consistently with the above. How about loopy_evaluate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we have a few names that are inconsistent with each other.
get_loopy_expansion_formation (for p2e)
get_loopy_evaluator (e2p)
loopy_translate_from (m2l)
preprocess_multipole_loopy_knl
postprocess_local_loopy_knl
Any suggestions on which to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer loopy_purpose_of_thing
. No get
. But the name should indicate whether optimizations are also being returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I would not be opposed to cleaning up the inconsistencies in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loopy_evaluate_with_optimizations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't work: it reads as "evaluate with optimizations", which is not what this is. Maybe making it loopy_noun_and_optimizations
, so loopy_evaluator_and_optimizations
might be better?
# DifferentiatedExprDerivativeTaker and sympy expressions, so we need to | ||
# make the taker a DifferentitatedExprDerivativeTaker instance. | ||
base_taker = DifferentiatedExprDerivativeTaker(base_taker, | ||
{tuple([0]*self.dim): 1}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the background to this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true anymore. AxisTargetDerivative.postprocess_at_target
and DirectionalTargetDerivative.postprocess_at_target
now handles ExprDerivativeTaker
.
sumpy/e2p.py
Outdated
@@ -81,15 +84,18 @@ def __init__(self, ctx, expansion, kernels, | |||
def default_name(self): | |||
pass | |||
|
|||
@memoize_method | |||
def get_cached_loopy_knl_and_optimizations(self): | |||
return self.expansion.get_loopy_evaluator(self.kernels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this would simply fail if E2P were used for M2P, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there's an implementation for M2P as well. (Just not super optimized)
sumpy/e2p.py
Outdated
@@ -81,15 +84,18 @@ def __init__(self, ctx, expansion, kernels, | |||
def default_name(self): | |||
pass | |||
|
|||
@memoize_method | |||
def get_cached_loopy_knl_and_optimizations(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the naming of this, there's already get_kernel
. The name should reflect that we're talking about the evaluator. Plus whether or not something is memoized isn't typically reflected in its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed now.
sumpy/expansion/local.py
Outdated
@@ -405,6 +410,14 @@ def loopy_translate_from(self, src_expansion): | |||
f"A direct loopy kernel for translation from " | |||
f"{src_expansion} to {self} is not implemented.") | |||
|
|||
def get_loopy_evaluator(self, kernels: Sequence[Kernel]) -> lp.TranslationUnit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer loopy_purpose_of_thing
. No get
. But the name should indicate whether optimizations are also being returned.
sumpy/expansion/local.py
Outdated
@@ -405,6 +410,14 @@ def loopy_translate_from(self, src_expansion): | |||
f"A direct loopy kernel for translation from " | |||
f"{src_expansion} to {self} is not implemented.") | |||
|
|||
def get_loopy_evaluator(self, kernels: Sequence[Kernel]) -> lp.TranslationUnit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I would not be opposed to cleaning up the inconsistencies in this PR.
def loopy_evaluator(self, kernels: Sequence[Kernel]) -> lp.TranslationUnit: | ||
def loopy_evaluator_and_optimizations(self, kernels: Sequence[Kernel]) \ | ||
-> Tuple[lp.TranslationUnit, Sequence[ | ||
Callable[[lp.TranslationUnit], lp.TranslationUnit]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- One optimization callable might suffice.
loopy_evaluator_and_transform
?- Consistency with
get_inner_knl_and_optimizations
?
(Don't feel strongly about any of this.)
make_e2p_loopy_kernel) | ||
try: | ||
return make_l2p_loopy_kernel_for_volume_taylor(self, kernels) | ||
except NotImplementedError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make a sub-brand of NotImplementedError
to make sure this is specific.
|
||
slowest_axis = axis_permutation[0] | ||
c = max_mi[slowest_axis] | ||
v = [pymbolic.var(f"x{i}") for i in range(dim)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think it'd be better if these were named
i<something>
. - Name the "vector" variable similar to the actual iname names, to
avoidminimize confusion.
for deriv_id in deriv_id_to_coeff]) | ||
|
||
def get_domains(v, iorder, with_sync): | ||
domains = [f"{{ [{x0}_outer]: 0<={x0}_outer<={order//c} }}"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
domains = [f"{{ [{x0}_outer]: 0<={x0}_outer<={order//c} }}"] | |
""" | |
:param with_sync: Whether to expose a loop nesting level that | |
is finer-grained than order, for synchronization purposes. | |
""" | |
domains = [f"{{ [{x0}_outer]: 0<={x0}_outer<={order//c} }}"] |
# the previous c rows set coeffs_copy[p-1, :] | ||
# and then read from coeffs_copy[p, :]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p
and p-1
: add mod 2.
result = 0 | ||
for mi, coeff in expr_dict.items(): | ||
result += coeff * self._diff(expr, bvec, mi) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return sum(...)
(also for the source version)
@@ -263,6 +273,18 @@ def get_derivative_coeff_dict_at_source(self, expr_dict): | |||
""" | |||
return expr_dict | |||
|
|||
def get_derivative_coeff_dict_at_target(self, expr_dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type-annotate (also for source)
@@ -263,6 +273,18 @@ def get_derivative_coeff_dict_at_source(self, expr_dict): | |||
""" | |||
return expr_dict | |||
|
|||
def get_derivative_coeff_dict_at_target(self, expr_dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Describe better what the multi-indices on input and output side.
- Generate the source/target docstrings from one source via
format
. - Name:
apply_target_transformation_to_derivative_coeff_dict
- Maybe make an explicit type that wraps the
expr_dict
as you suggested.
r"""Get the derivative transformation of the expression at target | ||
represented by the dictionary expr_dict which is mapping from multi-index | ||
`mi` to coefficient `coeff`. | ||
Expression represented by the dictionary `expr_dict` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expression represented by the dictionary `expr_dict` is | |
The expression represented by the dictionary `expr_dict` is |
knl = lp.tag_inames(knl, {"itgt_box": "g.0"}) | ||
def get_optimized_kernel(self, max_ntargets_in_one_box): | ||
_, optimizations = self.get_loopy_evaluator_and_optimizations() | ||
knl = self.get_kernel(max_ntargets_in_one_box=max_ntargets_in_one_box) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment to explain the idea?
Unsubscribing... @-mention or request review once it's ready for a look or needs attention. |
Depends on #153
Depends on inducer/loopy#742
For compressed Taylor series L2P, we optimize the calculation by calculating the uncompressed coefficients from compressed ones in parallel using work items in a group. Let's say we are compressed using z axis and only
z=0, 1
are there, we calculate the parts of the Taylor series for z=0, 1 first. Then,z=2, 3
fromz=0, 1
z=4, 5
fromz=2, 3
For biharmonic 2D, it's slightly different. We use
z=0,1,2,3
to first calculate the parts of the Taylor series, thenz=4, 5
fromz=0, 1, 2, 3
z=4, 5
intoz=0, 1
storagez=6, 7
fromz=2, 3, 4, 5
z=6, 7
intoz=2, 3
storagez=8, 9
fromz=4, 5, 6, 7
z=8, 9
intoz=4, 5
storagez=10, 11
fromz=6, 7, 8, 9
z=10, 11
intoz=6, 7
storage