Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load DNN from Binary #143

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft

Load DNN from Binary #143

wants to merge 6 commits into from

Conversation

GNiendorf
Copy link
Member

@GNiendorf GNiendorf commented Jan 7, 2025

Work in progress, the timing on GPU becomes much slower with these changes (see below). Timing on CPU is largely unchanged.

To summarize, it seems like the const keyword for the DNN weights applies optimizations that significantly improve inference time. More than even keeping the weights in constant memory. Which is unfortunate because there's no way to use half-precision with const arrays in CUDA at the moment since they have to be defined at compile-time. So unless half-precision improves the inference time by some large factor, it makes sense to keep the weights in const arrays instead of loading them at the start of the run.

Screenshot 2025-01-08 at 10 25 33 AM

From looking at the SASS, I see that when the const keyword is used the NN weights are embedded directly into the FFMA instructions and this seems to be what is causing the speedup.

https://forums.developer.nvidia.com/t/performance-benefit-of-const-keyword-on-dnn-inference/319454

* Layers: A parameter pack of layer types (e.g. DenseLayer<23,32>, DenseLayer<32,1>, etc.)
*/
template <class... Layers>
class Dnn {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps keep everything in lst namespace for now

Comment on lines +12 to +14
DenseLayer<23, 32> layer1;
DenseLayer<32, 32> layer2;
DenseLayer<32, 1> layer3;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we have named constants for these 23 and 32?
... also, why DevData ? This doesn't look like a device-specific type

Comment on lines 118 to 129
auto model = Dnn<DenseLayer<23, 32>, DenseLayer<32, 32>, DenseLayer<32, 1>>(
"/mnt/data1/gsn27/cmssw-fresh/CMSSW_14_2_0_pre4/src/RecoTracker/LSTCore/standalone/analysis/DNN/"
"network_weights.bin");

// Copy the loaded model into a host DnnWeightsDevData struct
lst::DnnWeightsDevData hostDnn;
{
auto const& layers = model.getLayers();
hostDnn.layer1 = std::get<0>(layers);
hostDnn.layer2 = std::get<1>(layers);
hostDnn.layer3 = std::get<2>(layers);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if there is a way to define the parameter pack and index relationship in DnnWeightsDevData to simplify this code

@GNiendorf
Copy link
Member Author

GNiendorf commented Jan 7, 2025

Just committed the first working version, although it's much slower on GPU.
/run all

ALPAKA_FN_ACC ALPAKA_FN_INLINE void linear_layer(
const float (&input)[IN_FEATURES],
float (&output)[OUT_FEATURES],
const std::array<std::array<float, OUT_FEATURES>, IN_FEATURES>& weights,
Copy link
Member Author

@GNiendorf GNiendorf Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 31 here, changing the weight arrays to these 2d std array's, is what causes most of the timing increase on GPU.

Without changing this line (but keeping all other changes, including the bias change).
Screenshot 2025-01-07 at 5 55 33 PM

After changing this line (and passing in the 2d std arrays from the dnnPtr below).
Screenshot 2025-01-07 at 4 51 14 PM

Copy link

github-actions bot commented Jan 7, 2025

There was a problem while building and running in standalone mode. The logs can be found here.

@GNiendorf
Copy link
Member Author

/run all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this here until we find a better location for it.

Copy link

github-actions bot commented Jan 7, 2025

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     45.4    397.1    187.3    149.1    145.6    546.1    124.2    232.1    150.0      3.6    1980.5    1389.0+/- 384.9     524.8   explicit[s=4] (target branch)
   avg     48.4    397.7    189.0    150.6    145.2    504.8    121.1    235.6    144.4      3.5    1940.2    1387.0+/- 380.8     514.9   explicit[s=4] (this PR)

Copy link

github-actions bot commented Jan 8, 2025

There was a problem while building and running with CMSSW. The logs can be found here.

@GNiendorf
Copy link
Member Author

GNiendorf commented Jan 8, 2025

@slava77 It seems like the real issue with the weights is that we get a significant speedup from placing the current weights in constant memory. If I replace the existing weights' constant classifier with just __device__ the timing is exactly the same as what's shown above, 3.8 ms for T5 building for single stream. I will have to declare the weight arrays in constant memory and then copy them into those constant arrays at the start of the run.

edit: I'm not sure if it's constant memory or just the const keyword. If I replace ALPAKA_STATIC_ACC_MEM_GLOBAL const with just ALPAKA_STATIC_ACC_MEM_CONSTANT the timing gets substantially worse (~2.4 ms single stream for T5 building). If I replace ALPAKA_STATIC_ACC_MEM_GLOBAL const with __device__ const the timing doesn't change (still fast) but if I remove the const and just have __device__ the timing becomes very slow. I should have to state __constant__ for CUDA to place an array in constant memory, right? Just the keyword const shouldn't do that? So the const keyword must be applying some great optimizations...

@GNiendorf
Copy link
Member Author

Unless I'm misunderstanding something, I don't think this PR can proceed... if optimizations from the const keyword used for the DNN weights are responsible for the fast inference time, then the weights must be stored at compile-time and not loaded at the start of the run. Also, since CUDA doesn't allow for half-precision const arrays, it's currently impossible to see any benefit from both the const keyword and half-precision. It may be the case that the benefits from lower precision outweigh benefits from the const keyword, but that would probably take some effort to figure out.

@slava77
Copy link

slava77 commented Jan 8, 2025

@GNiendorf
I couldn't quite follow the implied code changes from your comments to match the slow-down.
Perhaps you can prepare a list of (shortish) commits with corresponding timing values and discuss it in the GPU mattermost.

@GNiendorf
Copy link
Member Author

Sounds good, I'll prepare some commits like you suggested. Here is a table version of the different results.

Screenshot 2025-01-08 at 10 25 33 AM

@slava77
Copy link

slava77 commented Jan 8, 2025

Sounds good, I'll prepare some commits like you suggested. Here is a table version of the different results.
Screenshot 2025-01-08 at 10 25 33 AM

do I understand correctly that the first 4 lines refer to the code in the baseline/master, when changing the array qualifiers in RecoTracker/LSTCore/src/alpaka/NeuralNetworkWeights.h
ALPAKA_STATIC_ACC_MEM_GLOBAL const float wgtT_layer1[23][32] ?
or is it really where you commented in the argument qualifiers of the ALPAKA_FN_ACC ALPAKA_FN_INLINE void linear_layer( function?

@GNiendorf
Copy link
Member Author

GNiendorf commented Jan 8, 2025

@slava77 it's the first one, changing the keywords in NeuralNetworkWeights.h for the baseline/master. Not the function arguments.

@slava77
Copy link

slava77 commented Jan 11, 2025

Sounds good, I'll prepare some commits like you suggested. Here is a table version of the different results.
Screenshot 2025-01-08 at 10 25 33 AM

I'm curious if a similar pattern is visible on an older GPU, like V100 (on phi3).
These do not have const registers.
I guess the same idea can also be checked by looking at the assembly for the more recent GPUs as well to see where the inputs end up in the const registers or the regular ones.

@GNiendorf
Copy link
Member Author

https://forums.developer.nvidia.com/t/performance-benefit-of-const-keyword-on-dnn-inference/319454

From looking at the SASS, I see that when the const keyword is used the NN weights are embedded directly into the FFMA instructions and this seems to be what is causing the speedup. Also linked a very closely related (but slightly outdated) stackoverflow discussion below.

https://stackoverflow.com/questions/28041536/compile-constant-memory-array-to-immediate-value-in-cuda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants