Load DNN from Binary #143

GNiendorf · 2025-01-07T16:27:46Z

Work in progress, the timing on GPU becomes much slower with these changes (see below). Timing on CPU is largely unchanged.

To summarize, it seems like the const keyword for the DNN weights applies optimizations that significantly improve inference time. More than even keeping the weights in constant memory. Which is unfortunate because there's no way to use half-precision with const arrays in CUDA at the moment since they have to be defined at compile-time. So unless half-precision improves the inference time by some large factor, it makes sense to keep the weights in const arrays instead of loading them at the start of the run.

From looking at the SASS, I see that when the const keyword is used the NN weights are embedded directly into the FFMA instructions and this seems to be what is causing the speedup.

https://forums.developer.nvidia.com/t/performance-benefit-of-const-keyword-on-dnn-inference/319454

slava77 · 2025-01-07T17:07:40Z

RecoTracker/LSTCore/interface/Dnn.h

+ * Layers: A parameter pack of layer types (e.g. DenseLayer<23,32>, DenseLayer<32,1>, etc.)
+ */
+template <class... Layers>
+class Dnn {


perhaps keep everything in lst namespace for now

slava77 · 2025-01-07T17:12:27Z

RecoTracker/LSTCore/interface/DnnWeightsDevSoA.h

+    DenseLayer<23, 32> layer1;
+    DenseLayer<32, 32> layer2;
+    DenseLayer<32, 1> layer3;


don't we have named constants for these 23 and 32?
... also, why DevData ? This doesn't look like a device-specific type

slava77 · 2025-01-07T17:21:25Z

RecoTracker/LSTCore/src/LSTESData.cc

+  auto model = Dnn<DenseLayer<23, 32>, DenseLayer<32, 32>, DenseLayer<32, 1>>(
+      "/mnt/data1/gsn27/cmssw-fresh/CMSSW_14_2_0_pre4/src/RecoTracker/LSTCore/standalone/analysis/DNN/"
+      "network_weights.bin");
+
+  // Copy the loaded model into a host DnnWeightsDevData struct
+  lst::DnnWeightsDevData hostDnn;
+  {
+    auto const& layers = model.getLayers();
+    hostDnn.layer1 = std::get<0>(layers);
+    hostDnn.layer2 = std::get<1>(layers);
+    hostDnn.layer3 = std::get<2>(layers);
+  }


I'm curious if there is a way to define the parameter pack and index relationship in DnnWeightsDevData to simplify this code

GNiendorf · 2025-01-07T23:28:21Z

Just committed the first working version, although it's much slower on GPU.
/run all

GNiendorf · 2025-01-07T23:31:40Z

RecoTracker/LSTCore/src/alpaka/NeuralNetwork.h

+  ALPAKA_FN_ACC ALPAKA_FN_INLINE void linear_layer(
+      const float (&input)[IN_FEATURES],
+      float (&output)[OUT_FEATURES],
+      const std::array<std::array<float, OUT_FEATURES>, IN_FEATURES>& weights,


Line 31 here, changing the weight arrays to these 2d std array's, is what causes most of the timing increase on GPU.

Without changing this line (but keeping all other changes, including the bias change).

After changing this line (and passing in the 2d std arrays from the dnnPtr below).

github-actions · 2025-01-07T23:36:34Z

There was a problem while building and running in standalone mode. The logs can be found here.

GNiendorf · 2025-01-07T23:40:01Z

/run all

GNiendorf · 2025-01-07T23:50:59Z

RecoTracker/LSTCore/standalone/analysis/DNN/network_weights.bin

Keeping this here until we find a better location for it.

github-actions · 2025-01-07T23:54:43Z

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     45.4    397.1    187.3    149.1    145.6    546.1    124.2    232.1    150.0      3.6    1980.5    1389.0+/- 384.9     524.8   explicit[s=4] (target branch)
   avg     48.4    397.7    189.0    150.6    145.2    504.8    121.1    235.6    144.4      3.5    1940.2    1387.0+/- 380.8     514.9   explicit[s=4] (this PR)

github-actions · 2025-01-08T01:19:50Z

There was a problem while building and running with CMSSW. The logs can be found here.

GNiendorf · 2025-01-08T03:14:04Z

@slava77 It seems like the real issue with the weights is that we get a significant speedup from placing the current weights in constant memory. If I replace the existing weights' constant classifier with just __device__ the timing is exactly the same as what's shown above, 3.8 ms for T5 building for single stream. I will have to declare the weight arrays in constant memory and then copy them into those constant arrays at the start of the run.

edit: I'm not sure if it's constant memory or just the const keyword. If I replace ALPAKA_STATIC_ACC_MEM_GLOBAL const with just ALPAKA_STATIC_ACC_MEM_CONSTANT the timing gets substantially worse (~2.4 ms single stream for T5 building). If I replace ALPAKA_STATIC_ACC_MEM_GLOBAL const with __device__ const the timing doesn't change (still fast) but if I remove the const and just have __device__ the timing becomes very slow. I should have to state __constant__ for CUDA to place an array in constant memory, right? Just the keyword const shouldn't do that? So the const keyword must be applying some great optimizations...

GNiendorf · 2025-01-08T04:53:01Z

Unless I'm misunderstanding something, I don't think this PR can proceed... if optimizations from the const keyword used for the DNN weights are responsible for the fast inference time, then the weights must be stored at compile-time and not loaded at the start of the run. Also, since CUDA doesn't allow for half-precision const arrays, it's currently impossible to see any benefit from both the const keyword and half-precision. It may be the case that the benefits from lower precision outweigh benefits from the const keyword, but that would probably take some effort to figure out.

slava77 · 2025-01-08T13:40:56Z

@GNiendorf
I couldn't quite follow the implied code changes from your comments to match the slow-down.
Perhaps you can prepare a list of (shortish) commits with corresponding timing values and discuss it in the GPU mattermost.

GNiendorf · 2025-01-08T15:27:06Z

Sounds good, I'll prepare some commits like you suggested. Here is a table version of the different results.

slava77 · 2025-01-08T15:51:11Z

Sounds good, I'll prepare some commits like you suggested. Here is a table version of the different results.

do I understand correctly that the first 4 lines refer to the code in the baseline/master, when changing the array qualifiers in RecoTracker/LSTCore/src/alpaka/NeuralNetworkWeights.h
ALPAKA_STATIC_ACC_MEM_GLOBAL const float wgtT_layer1[23][32] ?
or is it really where you commented in the argument qualifiers of the ALPAKA_FN_ACC ALPAKA_FN_INLINE void linear_layer( function?

GNiendorf · 2025-01-08T16:16:35Z

@slava77 it's the first one, changing the keywords in NeuralNetworkWeights.h for the baseline/master. Not the function arguments.

slava77 · 2025-01-11T00:02:20Z

Sounds good, I'll prepare some commits like you suggested. Here is a table version of the different results.

I'm curious if a similar pattern is visible on an older GPU, like V100 (on phi3).
These do not have const registers.
I guess the same idea can also be checked by looking at the assembly for the more recent GPUs as well to see where the inputs end up in the const registers or the regular ones.

GNiendorf · 2025-01-14T17:44:34Z

https://forums.developer.nvidia.com/t/performance-benefit-of-const-keyword-on-dnn-inference/319454

From looking at the SASS, I see that when the const keyword is used the NN weights are embedded directly into the FFMA instructions and this seems to be what is causing the speedup. Also linked a very closely related (but slightly outdated) stackoverflow discussion below.

https://stackoverflow.com/questions/28041536/compile-constant-memory-array-to-immediate-value-in-cuda

GNiendorf added 2 commits January 7, 2025 11:19

load dnn from bin initial commit

fdc37be

formatting fix

fda1a28

slava77 reviewed Jan 7, 2025

View reviewed changes

GNiendorf added 3 commits January 7, 2025 14:42

add code to save network weights to bin

a175d1c

first working version

f71b44c

remove hard-coded weights

f636e69

GNiendorf commented Jan 7, 2025

View reviewed changes

more formatting fixes

cc5c65b

GNiendorf commented Jan 7, 2025

View reviewed changes

RecoTracker/LSTCore/standalone/analysis/DNN/network_weights.bin

Copy link

Member Author

GNiendorf Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this here until we find a better location for it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load DNN from Binary #143

Load DNN from Binary #143

GNiendorf commented Jan 7, 2025 •

edited

Loading

slava77 Jan 7, 2025

slava77 Jan 7, 2025

slava77 Jan 7, 2025

GNiendorf commented Jan 7, 2025 •

edited

Loading

GNiendorf Jan 7, 2025 •

edited

Loading

github-actions bot commented Jan 7, 2025

GNiendorf commented Jan 7, 2025

GNiendorf Jan 7, 2025

github-actions bot commented Jan 7, 2025

github-actions bot commented Jan 8, 2025

GNiendorf commented Jan 8, 2025 •

edited

Loading

GNiendorf commented Jan 8, 2025

slava77 commented Jan 8, 2025

GNiendorf commented Jan 8, 2025

slava77 commented Jan 8, 2025

GNiendorf commented Jan 8, 2025 •

edited

Loading

slava77 commented Jan 11, 2025

GNiendorf commented Jan 14, 2025

Load DNN from Binary #143

Are you sure you want to change the base?

Load DNN from Binary #143

Conversation

GNiendorf commented Jan 7, 2025 • edited Loading

slava77 Jan 7, 2025

Choose a reason for hiding this comment

slava77 Jan 7, 2025

Choose a reason for hiding this comment

slava77 Jan 7, 2025

Choose a reason for hiding this comment

GNiendorf commented Jan 7, 2025 • edited Loading

GNiendorf Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jan 7, 2025

GNiendorf commented Jan 7, 2025

GNiendorf Jan 7, 2025

Choose a reason for hiding this comment

github-actions bot commented Jan 7, 2025

github-actions bot commented Jan 8, 2025

GNiendorf commented Jan 8, 2025 • edited Loading

GNiendorf commented Jan 8, 2025

slava77 commented Jan 8, 2025

GNiendorf commented Jan 8, 2025

slava77 commented Jan 8, 2025

GNiendorf commented Jan 8, 2025 • edited Loading

slava77 commented Jan 11, 2025

GNiendorf commented Jan 14, 2025

GNiendorf commented Jan 7, 2025 •

edited

Loading

GNiendorf commented Jan 7, 2025 •

edited

Loading

GNiendorf Jan 7, 2025 •

edited

Loading

GNiendorf commented Jan 8, 2025 •

edited

Loading

GNiendorf commented Jan 8, 2025 •

edited

Loading