Refactor websocket concurrency handling #5165

Nadahar · 2025-12-01T13:17:07Z

This is my attempt to fix #5162.

I've only redone the concurrency handling, that is, the strategy of locks/synchronization/volatile in addition to some minor "nullness tweaks".

It's quite complicated to explain exactly why I came to the design I did, it's been a lot of back and forth in my mind regarding some aspects. I'm not saying that this is the only way to solve it, but I think this is a solution. The exact choices depend on some things that must be guesstimated when it comes to performance.

What I have done is basically to design concurrency handling from scratch in the classes that I've modified. I've added some comments to document the strategy. I've not tried to "fix" what was there, because the old strategy wasn't documented, so it's impossible for me to guess where things were missing or where the strategy was flawed. Generally, I can say that I've protected all mutable fields, not just some, and I've reduced the number of locks from two to one in LogWebSocket. The reason is that I think whatever performance benefit using two offered was very negligible, if any at all, while the chance for deadlocks increases exponentially when juggling several locks. To avoid deadlocks when using more than one lock, one must ensure that they are always locked in the same order, for any potential path, including during exceptions. This can be hard to do, requires a lot of dicipline, and can be hard to document so that any future modifications of the code is done accordingly.

Feel free to ask about any concrete design choice, and I'll try to explain as best as I can why I did what I did.

I've given it aroudn 16 hours of testing with a lot of logging enabled, without any hiccups, but as always, it's impossible to test all circumstances that the code will see "on the battlefield".

Signed-off-by: Ravi Nadahar <[email protected]>

Nadahar · 2025-12-01T13:17:47Z

@florian-h05 FYI

openhab-bot · 2025-12-03T14:56:51Z

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/openhab-stop-working/166412/9

Copilot

Pull request overview

This pull request refactors websocket concurrency handling to address issue #5162 by implementing a comprehensive synchronization strategy across websocket classes. The changes simplify lock management and protect all mutable fields with appropriate synchronization primitives.

Reduced the number of locks from two to one in LogWebSocket to minimize deadlock risk
Added explicit synchronization documentation comments indicating which fields require guarding
Made TopicEventFilter truly immutable and thread-safe by using unmodifiable collections

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
TopicEventFilter.java	Refactored to use a single immutable `topicsRegexes` field, removing nullable pattern field and making the class properly thread-safe
LogWebSocket.java	Comprehensive concurrency refactor: added synchronization comments, changed `lastSentTime` and `lastSequence` to volatile, used `CopyOnWriteArrayList` for patterns, unified lock strategy to single monitor ("this")
EventWebSocket.java	Added synchronization to all mutable fields, fixed bug where source filter was logging `typeFilter` value, improved thread-safety by copying filter references in `processEvent()`
CommonWebSocketServlet.java	Replaced `HashMap` with `ConcurrentHashMap` for thread-safe access to connection handlers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

....openhab.core.io.websocket/src/main/java/org/openhab/core/io/websocket/log/LogWebSocket.java

Nadahar · 2025-12-05T00:56:14Z

Input wanted: When considering the two Copilot comments, I think I see a problem that might lead to log messages being sent in a non-chronological order in edge-cases. It doesn't compromise thread-safety, but it might be unexpected, both by the receiving code and the user reading it, that entries aren't in chronological order.

My goal has been to change as little as possible while preventing the concurrency issues, modifying the logic itself slightly increases the chance of introducing a bug, which is why I'm hesitant. At the same time, potentially sending events in non-chronological order isn't ideal. Should I try to modify it or not?

florian-h05 · 2025-12-05T10:26:36Z

I think you should try to modify so logs are sent in correct order.

Nadahar · 2025-12-05T15:11:38Z

@florian-h05 I have done the change, but during my testing I have an odd problem that I can't fathom that is caused by my changes: After a while, 15 minutes-ish, the socket is disconnected (I don't know by whom), after which no more log events are received.

The browser console keeps logging the following when trying to send a heartbeat:

log-viewer.vue:659  WebSocket is already in CLOSING or CLOSED state.

I just cannot understand how I could have made any changes that will close the socket, and I didn't see this problem when I tested this earlier. It could be some kind of coincidence, but I'm wondering if any changes have been done in MainUI in the meanwhile that might have to do with this?

Also, it seems strange to me to have a heartbeat, but not catching the fact that it fails and try to do something about it. When the heartbeat fails, it should either indicate an error in some way, or try to reconnect - or maybe both?

If I "stop" the logging from the UI and then "start" it again, entries will again be received, because it reconnects the socket.

edit: I think it might just found the cause:

2025-12-05 15:41:55.410 [INFO ] [b.core.io.websocket.log.LogWebSocket] - WebSocket error: java.util.concurrent.TimeoutException: Idle timeout expired: 10001/10000 ms

Why is there an idle timeout on the logging socket? Does that make sense?

If no logs are sent for 10 seconds, it will disconnect..? Or is this what the heartbeats are supposed to prevent? Something isn't working as it should here...

Nadahar · 2025-12-05T18:21:53Z

I think I found another piece of the puzzle of why logging stops, I found this exception:

org.ops4j.pax.logging.pax-logging-log4j2 [pax-logging] ERROR : 'org.openhab.core.io.websocket.log.LogWebSocket@20e0be21' is removed as a LogListener since it threw an exception.
java.lang.IllegalStateException: Blocking message pending 10000 for BLOCKING
        at org.eclipse.jetty.websocket.common.WebSocketRemoteEndpoint.lockMsg(WebSocketRemoteEndpoint.java:133)
        at org.eclipse.jetty.websocket.common.WebSocketRemoteEndpoint.sendString(WebSocketRemoteEndpoint.java:392)
        at org.openhab.core.io.websocket.log.LogWebSocket.sendMessage(LogWebSocket.java:273)
        at org.openhab.core.io.websocket.log.LogWebSocket.logged(LogWebSocket.java:305)
        at org.ops4j.pax.logging.spi.support.LogReaderServiceImpl.fire(LogReaderServiceImpl.java:105)
        at org.ops4j.pax.logging.spi.support.LogReaderServiceImpl.fireEvent(LogReaderServiceImpl.java:83)
        at org.ops4j.pax.logging.log4j2.internal.PaxLoggingServiceImpl.handleEvents(PaxLoggingServiceImpl.java:410)
        at org.ops4j.pax.logging.log4j2.internal.PaxLoggerImpl.doLog0(PaxLoggerImpl.java:1145)
        at org.ops4j.pax.logging.log4j2.internal.PaxLoggerImpl.doLog(PaxLoggerImpl.java:1110)
        at org.ops4j.pax.logging.log4j2.internal.PaxLoggerImpl.debug(PaxLoggerImpl.java:254)
        at org.ops4j.pax.logging.internal.TrackingLogger.debug(TrackingLogger.java:167)
        at org.ops4j.pax.logging.slf4j.Slf4jLogger.debug(Slf4jLogger.java:479)
        at org.openhab.core.common.registry.AbstractManagedProvider.update(AbstractManagedProvider.java:117)
        at org.openhab.core.thing.internal.ThingManagerImpl.thingUpdated(ThingManagerImpl.java:250)
        at org.openhab.core.thing.internal.ThingHandlerCallbackImpl.thingUpdated(ThingHandlerCallbackImpl.java:156)
        at org.openhab.core.thing.binding.BaseThingHandler.updateConfiguration(BaseThingHandler.java:532)
        at org.openhab.binding.milllan.internal.AbstractMillThingHandler.updateConfiguration(AbstractMillThingHandler.java:3572)
        at org.openhab.binding.milllan.internal.AbstractMillThingHandler.pollOpenWindowParameters(AbstractMillThingHandler.java:1776)
        at org.openhab.binding.milllan.internal.MillPanelHeaterHandler$PollInfrequent.run(MillPanelHeaterHandler.java:116)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
org.ops4j.pax.logging.pax-logging-log4j2 [pax-logging] ERROR : 'org.openhab.core.io.websocket.log.LogWebSocket@fe1fea5' is removed as a LogListener since it threw an exception.

So, it seems to me that when the websocket times out, an IllegalStateException is thrown in, but sendMessage() only expects IOExceptions, so it's not caught. As a result, it propagates up the logging framework, which throws it out as a LogListener because it threw an exception. This web socket will thus never receive another log entry.

edit: Actually, I think it might be worse. It seems like the logging thread is used to send the event through the websocket, and the logging framework probably have a timeout mechanism on its thread pool that kills "hanging" loggers.

Nadahar · 2025-12-05T22:09:19Z

I found yet another problem: If DEBUG logging is on, so that the logging in WrappedScheduledExecutorService is active, flushing the buffer using an executor might trigger a "circle of doom" where each call to the executor triggers a log entry, which then triggers another call to the executor, which basically goes on "forever" until the system crashes.

I think this class must have its own, independent threads so that there is no logging whatsoever. The act of handling a log entry cannot create a new log entry, or you have the "circle of doom". The only way that I can see to make sure of this, is to completely control the thread pool. But, that might introduce challenges with managing the pool and shutting it down. I'm not sure what controls the lifecycle of a LogWebSocket... is it created by jetty somehow?

…he logger thread Signed-off-by: Ravi Nadahar <[email protected]>

Signed-off-by: Ravi Nadahar <[email protected]>

Nadahar · 2025-12-06T13:38:57Z

I have addressed the things that I believe can be addressed here:

FIFO delivery of log events should be ensured.
The blocking calls to send data over the websocket is no longer done using the logger framework provided thread.
A dedicated (single threaded) executors is used instead of one provided by ThreadPoolManager to make sure that no logging is generated by the executor when performing logging tasks.

The heartbeat/reconnect issues must be solved in MainUI AFAICU.

I've given this a fair bit of testing, and the latest iteration has been dead stable for me.

Refactor websocket concurrency handling

f4641ca

Signed-off-by: Ravi Nadahar <[email protected]>

Nadahar requested a review from a team as a code owner December 1, 2025 13:17

wborn requested a review from Copilot December 4, 2025 22:34

Copilot started reviewing on behalf of wborn December 4, 2025 22:35 View session

Copilot finished reviewing on behalf of wborn December 4, 2025 22:37

Copilot AI reviewed Dec 4, 2025

View reviewed changes

....openhab.core.io.websocket/src/main/java/org/openhab/core/io/websocket/log/LogWebSocket.java Outdated Show resolved Hide resolved

....openhab.core.io.websocket/src/main/java/org/openhab/core/io/websocket/log/LogWebSocket.java Show resolved Hide resolved

Ravi Nadahar added 2 commits December 6, 2025 01:49

Modify web socket buffering logic to ensure FIFO and avoid blocking t…

749bb04

…he logger thread Signed-off-by: Ravi Nadahar <[email protected]>

Use a dedicated executor that is guaranteed not to log anything

cc257bb

Signed-off-by: Ravi Nadahar <[email protected]>

Nadahar mentioned this pull request Dec 6, 2025

Use dedicated executor for long-running SDDP discovery tasks #5174

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Refactor websocket concurrency handling #5165

Refactor websocket concurrency handling #5165

Nadahar commented Dec 1, 2025

Uh oh!

Nadahar commented Dec 1, 2025

Uh oh!

openhab-bot commented Dec 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Nadahar commented Dec 5, 2025

Uh oh!

florian-h05 commented Dec 5, 2025

Uh oh!

Nadahar commented Dec 5, 2025 •

edited

Loading

Uh oh!

Nadahar commented Dec 5, 2025 •

edited

Loading

Uh oh!

Nadahar commented Dec 5, 2025

Uh oh!

Nadahar commented Dec 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Refactor websocket concurrency handling #5165

Are you sure you want to change the base?

Refactor websocket concurrency handling #5165

Conversation

Nadahar commented Dec 1, 2025

Uh oh!

Nadahar commented Dec 1, 2025

Uh oh!

openhab-bot commented Dec 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Nadahar commented Dec 5, 2025

Uh oh!

florian-h05 commented Dec 5, 2025

Uh oh!

Nadahar commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nadahar commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nadahar commented Dec 5, 2025

Uh oh!

Nadahar commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Nadahar commented Dec 5, 2025 •

edited

Loading

Nadahar commented Dec 5, 2025 •

edited

Loading

Nadahar commented Dec 6, 2025 •

edited

Loading