Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize L2P for GPUs #158

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Optimize L2P for GPUs #158

wants to merge 4 commits into from

Conversation

isuruf
Copy link
Collaborator

@isuruf isuruf commented Feb 15, 2023

Depends on #153
Depends on inducer/loopy#742

For compressed Taylor series L2P, we optimize the calculation by calculating the uncompressed coefficients from compressed ones in parallel using work items in a group. Let's say we are compressed using z axis and only z=0, 1 are there, we calculate the parts of the Taylor series for z=0, 1 first. Then,

  • calculate coeffs for z=2, 3 from z=0, 1
  • synchronize
  • calculate the parts of the Taylor series
  • calculate coeffs for z=4, 5 from z=2, 3
  • synchronize
  • calculate the parts of the Taylor series
  • so on

For biharmonic 2D, it's slightly different. We use z=0,1,2,3 to first calculate the parts of the Taylor series, then

  • calculate coeffs for z=4, 5 from z=0, 1, 2, 3
  • synchronize
  • copy z=4, 5 into z=0, 1 storage
  • calculate coeffs for z=6, 7 from z=2, 3, 4, 5
  • synchronize
  • copy z=6, 7 into z=2, 3 storage
  • synchronize
  • calculate the parts of the Taylor series
  • calculate coeffs for z=8, 9 from z=4, 5, 6, 7
  • synchronize
  • copy z=8, 9 into z=4, 5 storage
  • calculate coeffs for z=10, 11 from z=6, 7, 8, 9
  • synchronize
  • copy z=10, 11 into z=6, 7 storage
  • synchronize
  • calculate the parts of the Taylor series

@isuruf isuruf changed the title Optimize E2P for GPUs Optimize L2P for GPUs Feb 15, 2023
@isuruf isuruf force-pushed the e2p_opt branch 4 times, most recently from f39226d to 95132cb Compare February 17, 2023 09:37
@isuruf isuruf marked this pull request as ready for review February 17, 2023 18:16
@isuruf isuruf force-pushed the e2p_opt branch 2 times, most recently from 66e1fa4 to cfe3567 Compare May 17, 2023 19:50
@inducer
Copy link
Owner

inducer commented May 17, 2023

How come the CIs are failing? Is there an unmet dependency somewhere?

@isuruf
Copy link
Collaborator Author

isuruf commented May 17, 2023

Bad merge. Hopefully fixed now.

@@ -405,6 +410,14 @@ def loopy_translate_from(self, src_expansion):
f"A direct loopy kernel for translation from "
f"{src_expansion} to {self} is not implemented.")

def get_loopy_evaluator(self, kernels: Sequence[Kernel]) -> lp.TranslationUnit:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be named consistently with the above. How about loopy_evaluate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we have a few names that are inconsistent with each other.

get_loopy_expansion_formation  (for p2e)
get_loopy_evaluator  (e2p)
loopy_translate_from (m2l)
preprocess_multipole_loopy_knl
postprocess_local_loopy_knl

Any suggestions on which to use?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer loopy_purpose_of_thing. No get. But the name should indicate whether optimizations are also being returned.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I would not be opposed to cleaning up the inconsistencies in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loopy_evaluate_with_optimizations?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work: it reads as "evaluate with optimizations", which is not what this is. Maybe making it loopy_noun_and_optimizations, so loopy_evaluator_and_optimizations might be better?

# DifferentiatedExprDerivativeTaker and sympy expressions, so we need to
# make the taker a DifferentitatedExprDerivativeTaker instance.
base_taker = DifferentiatedExprDerivativeTaker(base_taker,
{tuple([0]*self.dim): 1})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the background to this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true anymore. AxisTargetDerivative.postprocess_at_target and DirectionalTargetDerivative.postprocess_at_target now handles ExprDerivativeTaker.

sumpy/e2p.py Outdated
@@ -81,15 +84,18 @@ def __init__(self, ctx, expansion, kernels,
def default_name(self):
pass

@memoize_method
def get_cached_loopy_knl_and_optimizations(self):
return self.expansion.get_loopy_evaluator(self.kernels)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this would simply fail if E2P were used for M2P, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there's an implementation for M2P as well. (Just not super optimized)

sumpy/e2p.py Outdated
@@ -81,15 +84,18 @@ def __init__(self, ctx, expansion, kernels,
def default_name(self):
pass

@memoize_method
def get_cached_loopy_knl_and_optimizations(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the naming of this, there's already get_kernel. The name should reflect that we're talking about the evaluator. Plus whether or not something is memoized isn't typically reflected in its name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now.

@@ -405,6 +410,14 @@ def loopy_translate_from(self, src_expansion):
f"A direct loopy kernel for translation from "
f"{src_expansion} to {self} is not implemented.")

def get_loopy_evaluator(self, kernels: Sequence[Kernel]) -> lp.TranslationUnit:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer loopy_purpose_of_thing. No get. But the name should indicate whether optimizations are also being returned.

@@ -405,6 +410,14 @@ def loopy_translate_from(self, src_expansion):
f"A direct loopy kernel for translation from "
f"{src_expansion} to {self} is not implemented.")

def get_loopy_evaluator(self, kernels: Sequence[Kernel]) -> lp.TranslationUnit:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I would not be opposed to cleaning up the inconsistencies in this PR.

def loopy_evaluator(self, kernels: Sequence[Kernel]) -> lp.TranslationUnit:
def loopy_evaluator_and_optimizations(self, kernels: Sequence[Kernel]) \
-> Tuple[lp.TranslationUnit, Sequence[
Callable[[lp.TranslationUnit], lp.TranslationUnit]]]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • One optimization callable might suffice.
  • loopy_evaluator_and_transform?
  • Consistency with get_inner_knl_and_optimizations?

(Don't feel strongly about any of this.)

make_e2p_loopy_kernel)
try:
return make_l2p_loopy_kernel_for_volume_taylor(self, kernels)
except NotImplementedError:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a sub-brand of NotImplementedError to make sure this is specific.


slowest_axis = axis_permutation[0]
c = max_mi[slowest_axis]
v = [pymbolic.var(f"x{i}") for i in range(dim)]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think it'd be better if these were named i<something>.
  • Name the "vector" variable similar to the actual iname names, to avoid minimize confusion.

for deriv_id in deriv_id_to_coeff])

def get_domains(v, iorder, with_sync):
domains = [f"{{ [{x0}_outer]: 0<={x0}_outer<={order//c} }}"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
domains = [f"{{ [{x0}_outer]: 0<={x0}_outer<={order//c} }}"]
"""
:param with_sync: Whether to expose a loop nesting level that
is finer-grained than order, for synchronization purposes.
"""
domains = [f"{{ [{x0}_outer]: 0<={x0}_outer<={order//c} }}"]

Comment on lines +415 to +416
# the previous c rows set coeffs_copy[p-1, :]
# and then read from coeffs_copy[p, :].
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p and p-1: add mod 2.

Comment on lines +259 to +262
result = 0
for mi, coeff in expr_dict.items():
result += coeff * self._diff(expr, bvec, mi)
return result
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return sum(...) (also for the source version)

@@ -263,6 +273,18 @@ def get_derivative_coeff_dict_at_source(self, expr_dict):
"""
return expr_dict

def get_derivative_coeff_dict_at_target(self, expr_dict):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type-annotate (also for source)

@@ -263,6 +273,18 @@ def get_derivative_coeff_dict_at_source(self, expr_dict):
"""
return expr_dict

def get_derivative_coeff_dict_at_target(self, expr_dict):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Describe better what the multi-indices on input and output side.
  • Generate the source/target docstrings from one source via format.
  • Name: apply_target_transformation_to_derivative_coeff_dict
  • Maybe make an explicit type that wraps the expr_dict as you suggested.

r"""Get the derivative transformation of the expression at target
represented by the dictionary expr_dict which is mapping from multi-index
`mi` to coefficient `coeff`.
Expression represented by the dictionary `expr_dict` is
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Expression represented by the dictionary `expr_dict` is
The expression represented by the dictionary `expr_dict` is

knl = lp.tag_inames(knl, {"itgt_box": "g.0"})
def get_optimized_kernel(self, max_ntargets_in_one_box):
_, optimizations = self.get_loopy_evaluator_and_optimizations()
knl = self.get_kernel(max_ntargets_in_one_box=max_ntargets_in_one_box)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to explain the idea?

@inducer
Copy link
Owner

inducer commented Sep 11, 2023

Unsubscribing... @-mention or request review once it's ready for a look or needs attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants