Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[multibody] Experiment using M frame for Inverse Dynamics, for timings/sanity check #22253

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sherm1
Copy link
Member

@sherm1 sherm1 commented Dec 3, 2024

WIP, not intended to merge, don't review

This branch will be an experiment to integrate Alejandro's M-frame inverse dynamics prototype into Drake to see what speedups we can get in real life switching from W to M.


This change is Reviewable

@sherm1
Copy link
Member Author

sherm1 commented Dec 5, 2024

Preliminary performance results for M-frame Inverse Dynamics. TL/DR: switching from W to M frame resulted in a 25% speedup for ID. Combined with previous changes ID is 2X faster than when we started. Details:

This PR ID-M Master ID-W Before speedup work
time μs 11.6 15.2 21.4
speedup 24% 26% --
total 46% 26% --

Cassie benchmark times on my old Puget Xeon [email protected], g++ 11.4
(ID-W is the World frame version in master, ID-M is the new M-frame method)

I'm still studying this to see where we can squeeze out more speed.
cc @amcastro-tri

@sherm1
Copy link
Member Author

sherm1 commented Dec 6, 2024

Since we're trying to match Pinocchio timings (we think about 2μs for Cassie-size ID), there are some further considerations to make an apples-to-apples comparison. We've been including position & velocity kinematics in ID timings. Possibly Pinocchio is leaving kinematics fixed and just measuring the ID time alone. Also, the above timings were with gcc 11.4 which does a poor optimizing job compared to clang 14.0.0. And, the Pinocchio timings were presumably run on a faster machine than my 7yo Puget. Let's see how those factors affect things. TL/DR: this gets us to 2μs. And we're within 2X even with kinematics included.

P+V+ID-W P+V+ID-M Just ID-M
time μs 4.95 3.74 2.14

Timed on my laptop: Xeon W-11855M CPU @ 3.20GHz, using clang 14.0.0
(ID-W is the World frame version in master, ID-M is the new M-frame method)

@sherm1 sherm1 force-pushed the better_inverse_dynamics branch 2 times, most recently from a1754dd to 5adfeef Compare December 16, 2024 20:02
@sherm1 sherm1 force-pushed the better_inverse_dynamics branch 2 times, most recently from fe6adf8 to 18faf6a Compare December 20, 2024 01:23
@sherm1 sherm1 force-pushed the better_inverse_dynamics branch 4 times, most recently from 8df8522 to 0cda315 Compare January 13, 2025 22:26
@sherm1
Copy link
Member Author

sherm1 commented Jan 13, 2025

Minor update: After profiling, I've been experimenting with SIMD implementations for operations that stand out:

  • Symmetric 3x3 matrix times 3-vector
  • Cross product wXr
  • Double cross produce wXwXr
  • Re-express spatial vector

Although all of these can be done with only a few packed floating point operations, only the last one was better than optimized C++ (according to llvm-mca in Godbolt). That's because of the many instructions required to fill and reorder the 4-element ymm registers prior to executing the packed fp. For short functions the loss of inlining is also likely a problem though I couldn't analyze that in Godbolt.

Cassiebench timing with the re-express spatial vector SIMD only provided a 2% speedup overall so it's not worth the extra complexity. My conclusion is that we can only get real SIMD speedups with more substantial operations. I'm not seeing good candidates in kinematics and ID, but will revisit this when I get to forward dynamics.

Interestingly (to me anyway) the compilers (g++ 11.4, clang 14) managed to do a little 2-wide SIMD when working with 3-vectors, using a double wide xmm operation followed by a scalar operation. This required much less shuffling so the overall performance was better than I could get after the contortions required to pack the 4-wide registers. This suggests to me that it will be futile to attempt to exploit the 8-wide zmm SIMD instructions in AVX512 for small data structures -- they will certainly be useful for large operations though.

Moving on now ...

@sherm1 sherm1 force-pushed the better_inverse_dynamics branch 4 times, most recently from fcbae08 to bf578b2 Compare January 17, 2025 02:12
@sherm1 sherm1 force-pushed the better_inverse_dynamics branch from 0c3357a to 5058cde Compare January 22, 2025 23:47
@sherm1 sherm1 force-pushed the better_inverse_dynamics branch 6 times, most recently from 8087d9a to c407a69 Compare January 30, 2025 00:40
@sherm1 sherm1 force-pushed the better_inverse_dynamics branch from 2264103 to a4bec21 Compare February 6, 2025 03:05
@sherm1 sherm1 force-pushed the better_inverse_dynamics branch 5 times, most recently from 2200bb7 to f81b2fd Compare February 27, 2025 21:39
@sherm1 sherm1 force-pushed the better_inverse_dynamics branch from f81b2fd to 920f732 Compare March 7, 2025 00:10
@sherm1 sherm1 force-pushed the better_inverse_dynamics branch 4 times, most recently from 3784226 to 3a2f0e8 Compare March 19, 2025 00:10
@sherm1 sherm1 force-pushed the better_inverse_dynamics branch 7 times, most recently from 32e9cfe to 37d1435 Compare March 25, 2025 23:45
@sherm1 sherm1 force-pushed the better_inverse_dynamics branch from 37d1435 to 38d12e2 Compare March 28, 2025 00:34
@sherm1
Copy link
Member Author

sherm1 commented Mar 28, 2025

This PR also contains the WIP on optimal F&M frames for revolute joints. With that mostly in and using the specialized axial rotation functions in #22814, just to see where we are I timed Cassie Position Kinematics (on my old Puget with some other things running). This is the existing F-frame algorithm, not the M-frame kinematics which is also in this WIP PR.

start of project master this PR
time μs 5.33 3.43 2.65
vs. master ---- ---- 23%
vs. start ---- 36% 50%

Considering there is not much going on in Position Kinematics, IMO a 2X speedup for the original algorithm is a good start! More to come ...

@sherm1
Copy link
Member Author

sherm1 commented Mar 31, 2025

For comparison (and apples-to-apples times vs published results), CassieBench Position Kinematics times for this PR on my laptop (Xeon W-11855M CPU @ 3.20GHz):

Old Puget (gcc) laptop gcc laptop clang
time μs 2.65 1.06 0.975

Caveat: it is hard to get repeatable times on my Ubuntu VM. The above times are the best of many runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant