Optimize L2P for GPUs #158

isuruf · 2023-02-15T18:16:49Z

Depends on #153
Depends on inducer/loopy#742

For compressed Taylor series L2P, we optimize the calculation by calculating the uncompressed coefficients from compressed ones in parallel using work items in a group. Let's say we are compressed using z axis and only z=0, 1 are there, we calculate the parts of the Taylor series for z=0, 1 first. Then,

calculate coeffs for z=2, 3 from z=0, 1
synchronize
calculate the parts of the Taylor series
calculate coeffs for z=4, 5 from z=2, 3
synchronize
calculate the parts of the Taylor series
so on

For biharmonic 2D, it's slightly different. We use z=0,1,2,3 to first calculate the parts of the Taylor series, then

calculate coeffs for z=4, 5 from z=0, 1, 2, 3
synchronize
copy z=4, 5 into z=0, 1 storage
calculate coeffs for z=6, 7 from z=2, 3, 4, 5
synchronize
copy z=6, 7 into z=2, 3 storage
synchronize
calculate the parts of the Taylor series
calculate coeffs for z=8, 9 from z=4, 5, 6, 7
synchronize
copy z=8, 9 into z=4, 5 storage
calculate coeffs for z=10, 11 from z=6, 7, 8, 9
synchronize
copy z=10, 11 into z=6, 7 storage
synchronize
calculate the parts of the Taylor series

inducer · 2023-05-17T20:17:24Z

How come the CIs are failing? Is there an unmet dependency somewhere?

isuruf · 2023-05-17T20:18:58Z

Bad merge. Hopefully fixed now.

inducer · 2023-06-02T00:08:45Z

sumpy/expansion/local.py

@@ -405,6 +410,14 @@ def loopy_translate_from(self, src_expansion):
            f"A direct loopy kernel for translation from "
            f"{src_expansion} to {self} is not implemented.")

+    def get_loopy_evaluator(self, kernels: Sequence[Kernel]) -> lp.TranslationUnit:


This should be named consistently with the above. How about loopy_evaluate?

Currently we have a few names that are inconsistent with each other.

get_loopy_expansion_formation (for p2e) get_loopy_evaluator (e2p) loopy_translate_from (m2l) preprocess_multipole_loopy_knl postprocess_local_loopy_knl

Any suggestions on which to use?

I think I would prefer loopy_purpose_of_thing. No get. But the name should indicate whether optimizations are also being returned.

Btw, I would not be opposed to cleaning up the inconsistencies in this PR.

loopy_evaluate_with_optimizations?

That doesn't work: it reads as "evaluate with optimizations", which is not what this is. Maybe making it loopy_noun_and_optimizations, so loopy_evaluator_and_optimizations might be better?

inducer · 2023-06-02T00:09:16Z

sumpy/expansion/multipole.py

-        # DifferentiatedExprDerivativeTaker and sympy expressions, so we need to
-        # make the taker a DifferentitatedExprDerivativeTaker instance.
-        base_taker = DifferentiatedExprDerivativeTaker(base_taker,
-                {tuple([0]*self.dim): 1})


Could you explain the background to this change?

This is not true anymore. AxisTargetDerivative.postprocess_at_target and DirectionalTargetDerivative.postprocess_at_target now handles ExprDerivativeTaker.

inducer · 2023-06-02T00:30:37Z

sumpy/e2p.py

@@ -81,15 +84,18 @@ def __init__(self, ctx, expansion, kernels,
    def default_name(self):
        pass

+    @memoize_method
+    def get_cached_loopy_knl_and_optimizations(self):
+        return self.expansion.get_loopy_evaluator(self.kernels)


It seems that this would simply fail if E2P were used for M2P, right?

No, there's an implementation for M2P as well. (Just not super optimized)

inducer · 2023-06-02T00:32:42Z

sumpy/e2p.py

@@ -81,15 +84,18 @@ def __init__(self, ctx, expansion, kernels,
    def default_name(self):
        pass

+    @memoize_method
+    def get_cached_loopy_knl_and_optimizations(self):


I don't like the naming of this, there's already get_kernel. The name should reflect that we're talking about the evaluator. Plus whether or not something is memoized isn't typically reflected in its name.

inducer · 2023-06-02T00:35:06Z

sumpy/expansion/local.py

@@ -405,6 +410,14 @@ def loopy_translate_from(self, src_expansion):
            f"A direct loopy kernel for translation from "
            f"{src_expansion} to {self} is not implemented.")

+    def get_loopy_evaluator(self, kernels: Sequence[Kernel]) -> lp.TranslationUnit:


I think I would prefer loopy_purpose_of_thing. No get. But the name should indicate whether optimizations are also being returned.

inducer · 2023-06-02T00:35:49Z

sumpy/expansion/local.py

@@ -405,6 +410,14 @@ def loopy_translate_from(self, src_expansion):
            f"A direct loopy kernel for translation from "
            f"{src_expansion} to {self} is not implemented.")

+    def get_loopy_evaluator(self, kernels: Sequence[Kernel]) -> lp.TranslationUnit:


Btw, I would not be opposed to cleaning up the inconsistencies in this PR.

inducer · 2023-09-11T16:52:17Z

sumpy/expansion/__init__.py

-    def loopy_evaluator(self, kernels: Sequence[Kernel]) -> lp.TranslationUnit:
+    def loopy_evaluator_and_optimizations(self, kernels: Sequence[Kernel]) \
+            -> Tuple[lp.TranslationUnit, Sequence[
+                Callable[[lp.TranslationUnit], lp.TranslationUnit]]]:


One optimization callable might suffice.

loopy_evaluator_and_transform?

Consistency with get_inner_knl_and_optimizations?

(Don't feel strongly about any of this.)

inducer · 2023-09-11T16:57:54Z

sumpy/expansion/local.py

+            make_e2p_loopy_kernel)
+        try:
+            return make_l2p_loopy_kernel_for_volume_taylor(self, kernels)
+        except NotImplementedError:


Make a sub-brand of NotImplementedError to make sure this is specific.

inducer · 2023-09-11T17:13:15Z

sumpy/expansion/loopy.py

+
+        slowest_axis = axis_permutation[0]
+        c = max_mi[slowest_axis]
+        v = [pymbolic.var(f"x{i}") for i in range(dim)]


I think it'd be better if these were named i<something>.

Name the "vector" variable similar to the actual iname names, to ~~avoid~~ minimize confusion.

inducer · 2023-09-11T17:14:39Z

sumpy/expansion/loopy.py

+                         for deriv_id in deriv_id_to_coeff])
+
+        def get_domains(v, iorder, with_sync):
+            domains = [f"{{ [{x0}_outer]: 0<={x0}_outer<={order//c} }}"]


Suggested change

domains = [f"{{ [{x0}_outer]: 0<={x0}_outer<={order//c} }}"]

"""

:param with_sync: Whether to expose a loop nesting level that

is finer-grained than order, for synchronization purposes.

"""

domains = [f"{{ [{x0}_outer]: 0<={x0}_outer<={order//c} }}"]

inducer · 2023-09-11T17:20:38Z

sumpy/expansion/loopy.py

+            # the previous c rows set coeffs_copy[p-1, :]
+            # and then read from coeffs_copy[p, :].


p and p-1: add mod 2.

inducer · 2023-09-11T17:55:22Z

sumpy/kernel.py

+        result = 0
+        for mi, coeff in expr_dict.items():
+            result += coeff * self._diff(expr, bvec, mi)
+        return result


return sum(...) (also for the source version)

inducer · 2023-09-11T17:59:54Z

sumpy/kernel.py

@@ -263,6 +273,18 @@ def get_derivative_coeff_dict_at_source(self, expr_dict):
        """
        return expr_dict

+    def get_derivative_coeff_dict_at_target(self, expr_dict):


Type-annotate (also for source)

inducer · 2023-09-11T18:09:54Z

sumpy/kernel.py

@@ -263,6 +273,18 @@ def get_derivative_coeff_dict_at_source(self, expr_dict):
        """
        return expr_dict

+    def get_derivative_coeff_dict_at_target(self, expr_dict):


Describe better what the multi-indices on input and output side.

Generate the source/target docstrings from one source via format.

Name: apply_target_transformation_to_derivative_coeff_dict

Maybe make an explicit type that wraps the expr_dict as you suggested.

inducer · 2023-09-11T18:10:04Z

sumpy/kernel.py

+        r"""Get the derivative transformation of the expression at target
+        represented by the dictionary expr_dict which is mapping from multi-index
+        `mi` to coefficient `coeff`.
+        Expression represented by the dictionary `expr_dict` is


Suggested change

Expression represented by the dictionary `expr_dict` is

The expression represented by the dictionary `expr_dict` is

inducer · 2023-09-11T18:10:53Z

sumpy/e2p.py

-        knl = lp.tag_inames(knl, {"itgt_box": "g.0"})
+    def get_optimized_kernel(self, max_ntargets_in_one_box):
+        _, optimizations = self.get_loopy_evaluator_and_optimizations()
+        knl = self.get_kernel(max_ntargets_in_one_box=max_ntargets_in_one_box)


Add a comment to explain the idea?

inducer · 2023-09-11T18:14:06Z

Unsubscribing... @-mention or request review once it's ready for a look or needs attention.

isuruf force-pushed the e2p_opt branch from 9b42640 to 3664d1c Compare February 15, 2023 18:40

isuruf changed the title ~~Optimize E2P for GPUs~~ Optimize L2P for GPUs Feb 15, 2023

isuruf force-pushed the e2p_opt branch 4 times, most recently from f39226d to 95132cb Compare February 17, 2023 09:37

isuruf marked this pull request as ready for review February 17, 2023 18:16

isuruf force-pushed the e2p_opt branch from 8e34173 to 422ed1a Compare April 25, 2023 08:21

isuruf force-pushed the e2p_opt branch 2 times, most recently from 66e1fa4 to cfe3567 Compare May 17, 2023 19:50

isuruf force-pushed the e2p_opt branch from cfe3567 to 8c81dcc Compare May 17, 2023 20:18

isuruf force-pushed the e2p_opt branch 3 times, most recently from 154bf81 to d89520a Compare May 26, 2023 17:10

isuruf force-pushed the e2p_opt branch from 5d5b166 to 4482f76 Compare May 28, 2023 06:08

inducer force-pushed the e2p_opt branch from 4482f76 to 1a67ca1 Compare June 1, 2023 16:18

inducer reviewed Jun 2, 2023

View reviewed changes

isuruf mentioned this pull request Jun 2, 2023

Refactor loopy methods #177

Merged

isuruf force-pushed the e2p_opt branch from 1a67ca1 to d910f40 Compare June 2, 2023 23:35

isuruf force-pushed the e2p_opt branch from d910f40 to 19531d0 Compare July 27, 2023 21:46

isuruf mentioned this pull request Jul 27, 2023

Optimize M2P #180

Draft

1 task

isuruf added 4 commits September 11, 2023 12:01

introduce get_derivative_coeff_dict_at_source counterpart for target

4e8d7f7

Optimize L2P for GPUs

9338f80

use new method name and 32->256

fab87bf

get_loopy_evaluator -> loopy_evaluator_and_optimizations

800480e

isuruf force-pushed the e2p_opt branch from 474e0d6 to 800480e Compare September 11, 2023 17:02

inducer reviewed Sep 11, 2023

View reviewed changes

		# the previous c rows set coeffs_copy[p-1, :]
		# and then read from coeffs_copy[p, :].

	Expression represented by the dictionary `expr_dict` is
	The expression represented by the dictionary `expr_dict` is

Optimize L2P for GPUs #158

Are you sure you want to change the base?

Optimize L2P for GPUs #158

Uh oh!

Conversation

isuruf commented Feb 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inducer commented May 17, 2023

Uh oh!

isuruf commented May 17, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inducer commented Sep 11, 2023

Uh oh!

Uh oh!

isuruf commented Feb 15, 2023 •

edited

Loading