-
-
Notifications
You must be signed in to change notification settings - Fork 457
Refactor websocket concurrency handling #5165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ravi Nadahar <[email protected]>
|
@florian-h05 FYI |
|
This pull request has been mentioned on openHAB Community. There might be relevant details there: https://community.openhab.org/t/openhab-stop-working/166412/9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request refactors websocket concurrency handling to address issue #5162 by implementing a comprehensive synchronization strategy across websocket classes. The changes simplify lock management and protect all mutable fields with appropriate synchronization primitives.
- Reduced the number of locks from two to one in
LogWebSocketto minimize deadlock risk - Added explicit synchronization documentation comments indicating which fields require guarding
- Made
TopicEventFiltertruly immutable and thread-safe by using unmodifiable collections
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| TopicEventFilter.java | Refactored to use a single immutable topicsRegexes field, removing nullable pattern field and making the class properly thread-safe |
| LogWebSocket.java | Comprehensive concurrency refactor: added synchronization comments, changed lastSentTime and lastSequence to volatile, used CopyOnWriteArrayList for patterns, unified lock strategy to single monitor ("this") |
| EventWebSocket.java | Added synchronization to all mutable fields, fixed bug where source filter was logging typeFilter value, improved thread-safety by copying filter references in processEvent() |
| CommonWebSocketServlet.java | Replaced HashMap with ConcurrentHashMap for thread-safe access to connection handlers |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
....openhab.core.io.websocket/src/main/java/org/openhab/core/io/websocket/log/LogWebSocket.java
Outdated
Show resolved
Hide resolved
....openhab.core.io.websocket/src/main/java/org/openhab/core/io/websocket/log/LogWebSocket.java
Show resolved
Hide resolved
|
Input wanted: When considering the two Copilot comments, I think I see a problem that might lead to log messages being sent in a non-chronological order in edge-cases. It doesn't compromise thread-safety, but it might be unexpected, both by the receiving code and the user reading it, that entries aren't in chronological order. My goal has been to change as little as possible while preventing the concurrency issues, modifying the logic itself slightly increases the chance of introducing a bug, which is why I'm hesitant. At the same time, potentially sending events in non-chronological order isn't ideal. Should I try to modify it or not? |
|
I think you should try to modify so logs are sent in correct order. |
|
@florian-h05 I have done the change, but during my testing I have an odd problem that I can't fathom that is caused by my changes: After a while, 15 minutes-ish, the socket is disconnected (I don't know by whom), after which no more log events are received. The browser console keeps logging the following when trying to send a heartbeat: log-viewer.vue:659 WebSocket is already in CLOSING or CLOSED state.I just cannot understand how I could have made any changes that will close the socket, and I didn't see this problem when I tested this earlier. It could be some kind of coincidence, but I'm wondering if any changes have been done in MainUI in the meanwhile that might have to do with this? Also, it seems strange to me to have a heartbeat, but not catching the fact that it fails and try to do something about it. When the heartbeat fails, it should either indicate an error in some way, or try to reconnect - or maybe both? If I "stop" the logging from the UI and then "start" it again, entries will again be received, because it reconnects the socket. edit: I think it might just found the cause: 2025-12-05 15:41:55.410 [INFO ] [b.core.io.websocket.log.LogWebSocket] - WebSocket error: java.util.concurrent.TimeoutException: Idle timeout expired: 10001/10000 msWhy is there an idle timeout on the logging socket? Does that make sense? If no logs are sent for 10 seconds, it will disconnect..? Or is this what the heartbeats are supposed to prevent? Something isn't working as it should here... |
|
I think I found another piece of the puzzle of why logging stops, I found this exception: org.ops4j.pax.logging.pax-logging-log4j2 [pax-logging] ERROR : 'org.openhab.core.io.websocket.log.LogWebSocket@20e0be21' is removed as a LogListener since it threw an exception.
java.lang.IllegalStateException: Blocking message pending 10000 for BLOCKING
at org.eclipse.jetty.websocket.common.WebSocketRemoteEndpoint.lockMsg(WebSocketRemoteEndpoint.java:133)
at org.eclipse.jetty.websocket.common.WebSocketRemoteEndpoint.sendString(WebSocketRemoteEndpoint.java:392)
at org.openhab.core.io.websocket.log.LogWebSocket.sendMessage(LogWebSocket.java:273)
at org.openhab.core.io.websocket.log.LogWebSocket.logged(LogWebSocket.java:305)
at org.ops4j.pax.logging.spi.support.LogReaderServiceImpl.fire(LogReaderServiceImpl.java:105)
at org.ops4j.pax.logging.spi.support.LogReaderServiceImpl.fireEvent(LogReaderServiceImpl.java:83)
at org.ops4j.pax.logging.log4j2.internal.PaxLoggingServiceImpl.handleEvents(PaxLoggingServiceImpl.java:410)
at org.ops4j.pax.logging.log4j2.internal.PaxLoggerImpl.doLog0(PaxLoggerImpl.java:1145)
at org.ops4j.pax.logging.log4j2.internal.PaxLoggerImpl.doLog(PaxLoggerImpl.java:1110)
at org.ops4j.pax.logging.log4j2.internal.PaxLoggerImpl.debug(PaxLoggerImpl.java:254)
at org.ops4j.pax.logging.internal.TrackingLogger.debug(TrackingLogger.java:167)
at org.ops4j.pax.logging.slf4j.Slf4jLogger.debug(Slf4jLogger.java:479)
at org.openhab.core.common.registry.AbstractManagedProvider.update(AbstractManagedProvider.java:117)
at org.openhab.core.thing.internal.ThingManagerImpl.thingUpdated(ThingManagerImpl.java:250)
at org.openhab.core.thing.internal.ThingHandlerCallbackImpl.thingUpdated(ThingHandlerCallbackImpl.java:156)
at org.openhab.core.thing.binding.BaseThingHandler.updateConfiguration(BaseThingHandler.java:532)
at org.openhab.binding.milllan.internal.AbstractMillThingHandler.updateConfiguration(AbstractMillThingHandler.java:3572)
at org.openhab.binding.milllan.internal.AbstractMillThingHandler.pollOpenWindowParameters(AbstractMillThingHandler.java:1776)
at org.openhab.binding.milllan.internal.MillPanelHeaterHandler$PollInfrequent.run(MillPanelHeaterHandler.java:116)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
org.ops4j.pax.logging.pax-logging-log4j2 [pax-logging] ERROR : 'org.openhab.core.io.websocket.log.LogWebSocket@fe1fea5' is removed as a LogListener since it threw an exception.So, it seems to me that when the websocket times out, an edit: Actually, I think it might be worse. It seems like the logging thread is used to send the event through the websocket, and the logging framework probably have a timeout mechanism on its thread pool that kills "hanging" loggers. |
|
I found yet another problem: If I think this class must have its own, independent threads so that there is no logging whatsoever. The act of handling a log entry cannot create a new log entry, or you have the "circle of doom". The only way that I can see to make sure of this, is to completely control the thread pool. But, that might introduce challenges with managing the pool and shutting it down. I'm not sure what controls the lifecycle of a |
…he logger thread Signed-off-by: Ravi Nadahar <[email protected]>
Signed-off-by: Ravi Nadahar <[email protected]>
|
I have addressed the things that I believe can be addressed here:
The heartbeat/reconnect issues must be solved in MainUI AFAICU. I've given this a fair bit of testing, and the latest iteration has been dead stable for me. |
This is my attempt to fix #5162.
I've only redone the concurrency handling, that is, the strategy of locks/synchronization/volatile in addition to some minor "nullness tweaks".
It's quite complicated to explain exactly why I came to the design I did, it's been a lot of back and forth in my mind regarding some aspects. I'm not saying that this is the only way to solve it, but I think this is a solution. The exact choices depend on some things that must be guesstimated when it comes to performance.
What I have done is basically to design concurrency handling from scratch in the classes that I've modified. I've added some comments to document the strategy. I've not tried to "fix" what was there, because the old strategy wasn't documented, so it's impossible for me to guess where things were missing or where the strategy was flawed. Generally, I can say that I've protected all mutable fields, not just some, and I've reduced the number of locks from two to one in
LogWebSocket. The reason is that I think whatever performance benefit using two offered was very negligible, if any at all, while the chance for deadlocks increases exponentially when juggling several locks. To avoid deadlocks when using more than one lock, one must ensure that they are always locked in the same order, for any potential path, including during exceptions. This can be hard to do, requires a lot of dicipline, and can be hard to document so that any future modifications of the code is done accordingly.Feel free to ask about any concrete design choice, and I'll try to explain as best as I can why I did what I did.
I've given it aroudn 16 hours of testing with a lot of logging enabled, without any hiccups, but as always, it's impossible to test all circumstances that the code will see "on the battlefield".