Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread safe slot maps without containers #1785

Conversation

aardvark179
Copy link
Contributor

Thread safe slot maps without containers

This is a stacked PR on top of

To enable thread safe slot maps without containers we want two properties to hold true:

  1. For empty and single entry maps which do not require a lock we must to a compare and exchange rather than a simple assignment when setting the map, if the exchange fails then we perform the operation with the new map. The number of times this can happen is strictly limited by the chain of slot map promotions.
  2. For larger maps that do require a lock we need to ensure the operation can be performed on the currently installed map at the time the lock is acquired.

Compare and exchange operation

We utilise the java.lang.invoke.VarHandle methods to perform this operation. These are not available on older versions of Android, or Java 8, but if thread safe maps are required in those environments then SlotMapOwner could be refactored to use a java.util.concurrent.atomic.AtomicReference at the expense of one object (and one level of indirection).

Promotion and lock sharing

Empty maps to single entry maps

Promoting from the empty map to a single entry slot map is simple, the empty map and single entry maps are immutable so we can construct the single entry map, perform the compare and exchange, and then we are either done, or we ask the currently installed map to perform the operation.

Single entry maps to larger maps

Again, we can construct the larger ThreadSafeEmbeddedSlotMap and perform a compare and exchange operation to install it. The exception to this is the compute operation, where we install the map (as we know a mutation will occur) and then perform the compute operation on this new map. This is done to avoid any side effects that might result from performing the compute operation twice.

Promotion between large maps

The promotion of new maps looks something like the following. Blue links represent references that exist up until the new map is installed, and red linked represent objects and references that only exist in the new map. Since old map cannot be mutated, nor the new map installed, without a lock this operation does not need to be entirely atomic. Note that both the old and new maps have the same lock object.

graph LR
Object["ScriptableObject"]
Lock["StampedLock"]
Map["OldSlotMap"]
Array["Slot[]"]
Slot1["Slot"]
Slot2["Slot"]
Slot3["Slot"]
Object --> Map
Map --> Array
Map --> Lock
Array --> Slot1
Array --> Slot2
Array --> Slot3
Map1["NewSlotMap"]
Array1["Slot[]"]
Slot4["Slot"]
Slot5["Slot"]
Slot6["Slot"]
Object --> Map1
Map1 --> Array1
Map1 --> Lock
Map --> Map1
Array1 --> Slot4
Array1 --> Slot5
Array1 --> Slot6
style Map1 stroke:red
style Array1 stroke:red
style Slot4 stroke:red
style Slot5 stroke:red
style Slot6 stroke:red
linkStyle 0 stroke:blue
linkStyle 6 stroke:red
linkStyle 8 stroke:red
linkStyle 9 stroke:red
linkStyle 10 stroke:red
linkStyle 11 stroke:red
linkStyle 12 stroke:red
Loading

This approach of sharing a single lock enables sequences like the following to work correctly:

sequenceDiagram
    Thread1 ->>+ OldMap: Acquires lock
    Thread2 ->>+ OldMap: Wait for lock
    Thread1 ->>+ OldMap: Promote map
    OldMap ->>+ NewMap: Replace map
    Thread1 ->>+ OldMap: Release lock
    Thread2 ->>+ OldMap: Acquire lock
    Thread2 ->>+ OldMap: Perform action
    Thread1 ->>+ NewMap: Wait for lock
    OldMap ->>+ NewMap: Delegate action
    Thread2 ->>+ OldMap: Release lock
    Thread1 ->>+ NewMap: Acquire Loc
Loading

@aardvark179 aardvark179 marked this pull request as ready for review December 30, 2024 16:31
@rbri
Copy link
Collaborator

rbri commented Dec 30, 2024

Another really interesting observation i guess the same is true for htmlunit. Will try it next year 😉

@aardvark179 aardvark179 force-pushed the aardvark179-remove-thread-safe-map-continaer branch from 2a70a9a to 5b1569b Compare January 9, 2025 20:14
@aardvark179
Copy link
Contributor Author

Rebased after merge of #1782.

@aardvark179 aardvark179 force-pushed the aardvark179-remove-thread-safe-map-continaer branch 3 times, most recently from 8ed6992 to 405b38c Compare January 23, 2025 14:26
@aardvark179 aardvark179 force-pushed the aardvark179-remove-thread-safe-map-continaer branch from 405b38c to f4e894a Compare January 28, 2025 11:25
@gbrail
Copy link
Collaborator

gbrail commented Jan 30, 2025

Just an update, because I am interested in this work continuing -- I can see a 15-20% performance regression for some of the V8 benchmarks when we introduce the "SingleEntrySlotMap." I'd like to spend more time trying to understand that first. If you have a chance I'd love to have your input too. Thanks!

@aardvark179
Copy link
Contributor Author

aardvark179 commented Jan 30, 2025

I’ll take a look. Which benchmarks are you seeing the regression on, and can you give more details on JVM version and architecture?

V8 deltaBlue in interpreted mode? I'figured that wasn't too important since it was in interpreted mode, but I think it should hopefully be recoverable.

@aardvark179
Copy link
Contributor Author

Okay, I see part of the problem here. The increased set of types seen in the get methods is changing the inlining. On deltablue this can cause a pretty big slowdown. It’s pretty easy to recover the performance in the compiled case, but the interpreted version may take some more work.

@gbrail
Copy link
Collaborator

gbrail commented Jan 31, 2025

Thanks -- I have been trying to improve performance bit for bit for code running in compiled mode for a long time, so that's why I usually check it. Specifically, I just re-ran and saw that "earley" and "deltaBlue," running in compiled mode, are both about 10% slower than we were with 1.8.0. I did a "git bisect" earlier in the week and it looked like the difference started to show up with "ab43fc90f4eb21beb050b4e77cbf0a6cd03a6010", which introduced the single-entry slot map.

This is on Intel, Windows, Java 21.0.5.

(I will experiment with making the "compute" method of that class work more like EmbeddedSlotMap and see if that changes anything.)

I appreciate you looking at this -- when you get a chance I'd love to see how you're diagnosing those problems with inlining and compiling in general, since they seem to have a big impact and I want to know what to look for.

@aardvark179
Copy link
Contributor Author

Mostly I’m looking at async profiler output on the benchmarks which will show which methods have been inlined. You generally want to bump up the iterations to get a decent number of samples.

I might have to generate some compiler logs in this case because I can recover most of the perf pretty easily, but I’m failing to get back that last little bit, and it feels quite fragile.

The hottest area of execution in most cases is property access, so I think it’s worth talking about how this area can be optimised further and the trade offs in that.

I’m in the UK (so UTC +0), let me know what sort of times are good for you and we can talk about stuff.

@aardvark179
Copy link
Contributor Author

Okay, I've taken a bit of a look at this, and I see 30% variation between different benchmark runs, on both the old and the new code. It seems to be caused by variations in compilation (I see a different set of methods getting inlined in the two cases). Adding -XX:-BackgroundCompilation to the JVM args that JMH passes appears to reduce this variation a lot. I'll try doing two full runs on master and preceding commit and see what those look like.

@gbrail
Copy link
Collaborator

gbrail commented Feb 8, 2025

I have a separate, very idle, Linux machine now, so I have a pretty consistent environment, and I see what you see -- one out of three benchmark runs, regardless of commit, will have that variation that you suggest. I also tried -XX:-BackgroundCompilation and that didn't seem to help.

So given all that, and that I can't see any reason why this particular change would make things worse, I'm going to do one more test and merge this, and thanks for all of your hard work!

I may create a doc with ideas about all this because I have tried an unbelievable number of ways to make Rhino significantly faster and with very little success!

@gbrail gbrail merged commit a015709 into mozilla:master Feb 8, 2025
3 checks passed
@aardvark179
Copy link
Contributor Author

I have some consistency on the tests I see which regress, and the germ of an idea for path to optimise this. I’ll try a little prototyping and write something up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants