-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread safe slot maps without containers #1785
Thread safe slot maps without containers #1785
Conversation
Another really interesting observation i guess the same is true for htmlunit. Will try it next year 😉 |
2a70a9a
to
5b1569b
Compare
Rebased after merge of #1782. |
8ed6992
to
405b38c
Compare
405b38c
to
f4e894a
Compare
Just an update, because I am interested in this work continuing -- I can see a 15-20% performance regression for some of the V8 benchmarks when we introduce the "SingleEntrySlotMap." I'd like to spend more time trying to understand that first. If you have a chance I'd love to have your input too. Thanks! |
I’ll take a look. Which benchmarks are you seeing the regression on, and can you give more details on JVM version and architecture? V8 deltaBlue in interpreted mode? I'figured that wasn't too important since it was in interpreted mode, but I think it should hopefully be recoverable. |
Okay, I see part of the problem here. The increased set of types seen in the get methods is changing the inlining. On deltablue this can cause a pretty big slowdown. It’s pretty easy to recover the performance in the compiled case, but the interpreted version may take some more work. |
Thanks -- I have been trying to improve performance bit for bit for code running in compiled mode for a long time, so that's why I usually check it. Specifically, I just re-ran and saw that "earley" and "deltaBlue," running in compiled mode, are both about 10% slower than we were with 1.8.0. I did a "git bisect" earlier in the week and it looked like the difference started to show up with "ab43fc90f4eb21beb050b4e77cbf0a6cd03a6010", which introduced the single-entry slot map. This is on Intel, Windows, Java 21.0.5. (I will experiment with making the "compute" method of that class work more like EmbeddedSlotMap and see if that changes anything.) I appreciate you looking at this -- when you get a chance I'd love to see how you're diagnosing those problems with inlining and compiling in general, since they seem to have a big impact and I want to know what to look for. |
Mostly I’m looking at async profiler output on the benchmarks which will show which methods have been inlined. You generally want to bump up the iterations to get a decent number of samples. I might have to generate some compiler logs in this case because I can recover most of the perf pretty easily, but I’m failing to get back that last little bit, and it feels quite fragile. The hottest area of execution in most cases is property access, so I think it’s worth talking about how this area can be optimised further and the trade offs in that. I’m in the UK (so UTC +0), let me know what sort of times are good for you and we can talk about stuff. |
Okay, I've taken a bit of a look at this, and I see 30% variation between different benchmark runs, on both the old and the new code. It seems to be caused by variations in compilation (I see a different set of methods getting inlined in the two cases). Adding |
I have a separate, very idle, Linux machine now, so I have a pretty consistent environment, and I see what you see -- one out of three benchmark runs, regardless of commit, will have that variation that you suggest. I also tried -XX:-BackgroundCompilation and that didn't seem to help. So given all that, and that I can't see any reason why this particular change would make things worse, I'm going to do one more test and merge this, and thanks for all of your hard work! I may create a doc with ideas about all this because I have tried an unbelievable number of ways to make Rhino significantly faster and with very little success! |
I have some consistency on the tests I see which regress, and the germ of an idea for path to optimise this. I’ll try a little prototyping and write something up. |
Thread safe slot maps without containers
This is a stacked PR on top of
To enable thread safe slot maps without containers we want two properties to hold true:
Compare and exchange operation
We utilise the
java.lang.invoke.VarHandle
methods to perform this operation. These are not available on older versions of Android, or Java 8, but if thread safe maps are required in those environments thenSlotMapOwner
could be refactored to use ajava.util.concurrent.atomic.AtomicReference
at the expense of one object (and one level of indirection).Promotion and lock sharing
Empty maps to single entry maps
Promoting from the empty map to a single entry slot map is simple, the empty map and single entry maps are immutable so we can construct the single entry map, perform the compare and exchange, and then we are either done, or we ask the currently installed map to perform the operation.
Single entry maps to larger maps
Again, we can construct the larger
ThreadSafeEmbeddedSlotMap
and perform a compare and exchange operation to install it. The exception to this is the compute operation, where we install the map (as we know a mutation will occur) and then perform the compute operation on this new map. This is done to avoid any side effects that might result from performing the compute operation twice.Promotion between large maps
The promotion of new maps looks something like the following. Blue links represent references that exist up until the new map is installed, and red linked represent objects and references that only exist in the new map. Since old map cannot be mutated, nor the new map installed, without a lock this operation does not need to be entirely atomic. Note that both the old and new maps have the same lock object.
This approach of sharing a single lock enables sequences like the following to work correctly: