Skip to content

Conversation

@Nadahar
Copy link
Contributor

@Nadahar Nadahar commented Dec 1, 2025

This is my attempt to fix #5162.

I've only redone the concurrency handling, that is, the strategy of locks/synchronization/volatile in addition to some minor "nullness tweaks".

It's quite complicated to explain exactly why I came to the design I did, it's been a lot of back and forth in my mind regarding some aspects. I'm not saying that this is the only way to solve it, but I think this is a solution. The exact choices depend on some things that must be guesstimated when it comes to performance.

What I have done is basically to design concurrency handling from scratch in the classes that I've modified. I've added some comments to document the strategy. I've not tried to "fix" what was there, because the old strategy wasn't documented, so it's impossible for me to guess where things were missing or where the strategy was flawed. Generally, I can say that I've protected all mutable fields, not just some, and I've reduced the number of locks from two to one in LogWebSocket. The reason is that I think whatever performance benefit using two offered was very negligible, if any at all, while the chance for deadlocks increases exponentially when juggling several locks. To avoid deadlocks when using more than one lock, one must ensure that they are always locked in the same order, for any potential path, including during exceptions. This can be hard to do, requires a lot of dicipline, and can be hard to document so that any future modifications of the code is done accordingly.

Feel free to ask about any concrete design choice, and I'll try to explain as best as I can why I did what I did.

I've given it aroudn 16 hours of testing with a lot of logging enabled, without any hiccups, but as always, it's impossible to test all circumstances that the code will see "on the battlefield".

@Nadahar Nadahar requested a review from a team as a code owner December 1, 2025 13:17
@Nadahar
Copy link
Contributor Author

Nadahar commented Dec 1, 2025

@florian-h05 FYI

@openhab-bot
Copy link
Collaborator

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/openhab-stop-working/166412/9

@wborn wborn requested a review from Copilot December 4, 2025 22:34
Copilot finished reviewing on behalf of wborn December 4, 2025 22:37
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors websocket concurrency handling to address issue #5162 by implementing a comprehensive synchronization strategy across websocket classes. The changes simplify lock management and protect all mutable fields with appropriate synchronization primitives.

  • Reduced the number of locks from two to one in LogWebSocket to minimize deadlock risk
  • Added explicit synchronization documentation comments indicating which fields require guarding
  • Made TopicEventFilter truly immutable and thread-safe by using unmodifiable collections

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
TopicEventFilter.java Refactored to use a single immutable topicsRegexes field, removing nullable pattern field and making the class properly thread-safe
LogWebSocket.java Comprehensive concurrency refactor: added synchronization comments, changed lastSentTime and lastSequence to volatile, used CopyOnWriteArrayList for patterns, unified lock strategy to single monitor ("this")
EventWebSocket.java Added synchronization to all mutable fields, fixed bug where source filter was logging typeFilter value, improved thread-safety by copying filter references in processEvent()
CommonWebSocketServlet.java Replaced HashMap with ConcurrentHashMap for thread-safe access to connection handlers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Nadahar
Copy link
Contributor Author

Nadahar commented Dec 5, 2025

Input wanted: When considering the two Copilot comments, I think I see a problem that might lead to log messages being sent in a non-chronological order in edge-cases. It doesn't compromise thread-safety, but it might be unexpected, both by the receiving code and the user reading it, that entries aren't in chronological order.

My goal has been to change as little as possible while preventing the concurrency issues, modifying the logic itself slightly increases the chance of introducing a bug, which is why I'm hesitant. At the same time, potentially sending events in non-chronological order isn't ideal. Should I try to modify it or not?

@florian-h05
Copy link
Contributor

I think you should try to modify so logs are sent in correct order.

@Nadahar
Copy link
Contributor Author

Nadahar commented Dec 5, 2025

@florian-h05 I have done the change, but during my testing I have an odd problem that I can't fathom that is caused by my changes: After a while, 15 minutes-ish, the socket is disconnected (I don't know by whom), after which no more log events are received.

The browser console keeps logging the following when trying to send a heartbeat:

log-viewer.vue:659  WebSocket is already in CLOSING or CLOSED state.

I just cannot understand how I could have made any changes that will close the socket, and I didn't see this problem when I tested this earlier. It could be some kind of coincidence, but I'm wondering if any changes have been done in MainUI in the meanwhile that might have to do with this?

Also, it seems strange to me to have a heartbeat, but not catching the fact that it fails and try to do something about it. When the heartbeat fails, it should either indicate an error in some way, or try to reconnect - or maybe both?

If I "stop" the logging from the UI and then "start" it again, entries will again be received, because it reconnects the socket.

edit: I think it might just found the cause:

2025-12-05 15:41:55.410 [INFO ] [b.core.io.websocket.log.LogWebSocket] - WebSocket error: java.util.concurrent.TimeoutException: Idle timeout expired: 10001/10000 ms

Why is there an idle timeout on the logging socket? Does that make sense?

If no logs are sent for 10 seconds, it will disconnect..? Or is this what the heartbeats are supposed to prevent? Something isn't working as it should here...

@Nadahar
Copy link
Contributor Author

Nadahar commented Dec 5, 2025

I think I found another piece of the puzzle of why logging stops, I found this exception:

org.ops4j.pax.logging.pax-logging-log4j2 [pax-logging] ERROR : 'org.openhab.core.io.websocket.log.LogWebSocket@20e0be21' is removed as a LogListener since it threw an exception.
java.lang.IllegalStateException: Blocking message pending 10000 for BLOCKING
        at org.eclipse.jetty.websocket.common.WebSocketRemoteEndpoint.lockMsg(WebSocketRemoteEndpoint.java:133)
        at org.eclipse.jetty.websocket.common.WebSocketRemoteEndpoint.sendString(WebSocketRemoteEndpoint.java:392)
        at org.openhab.core.io.websocket.log.LogWebSocket.sendMessage(LogWebSocket.java:273)
        at org.openhab.core.io.websocket.log.LogWebSocket.logged(LogWebSocket.java:305)
        at org.ops4j.pax.logging.spi.support.LogReaderServiceImpl.fire(LogReaderServiceImpl.java:105)
        at org.ops4j.pax.logging.spi.support.LogReaderServiceImpl.fireEvent(LogReaderServiceImpl.java:83)
        at org.ops4j.pax.logging.log4j2.internal.PaxLoggingServiceImpl.handleEvents(PaxLoggingServiceImpl.java:410)
        at org.ops4j.pax.logging.log4j2.internal.PaxLoggerImpl.doLog0(PaxLoggerImpl.java:1145)
        at org.ops4j.pax.logging.log4j2.internal.PaxLoggerImpl.doLog(PaxLoggerImpl.java:1110)
        at org.ops4j.pax.logging.log4j2.internal.PaxLoggerImpl.debug(PaxLoggerImpl.java:254)
        at org.ops4j.pax.logging.internal.TrackingLogger.debug(TrackingLogger.java:167)
        at org.ops4j.pax.logging.slf4j.Slf4jLogger.debug(Slf4jLogger.java:479)
        at org.openhab.core.common.registry.AbstractManagedProvider.update(AbstractManagedProvider.java:117)
        at org.openhab.core.thing.internal.ThingManagerImpl.thingUpdated(ThingManagerImpl.java:250)
        at org.openhab.core.thing.internal.ThingHandlerCallbackImpl.thingUpdated(ThingHandlerCallbackImpl.java:156)
        at org.openhab.core.thing.binding.BaseThingHandler.updateConfiguration(BaseThingHandler.java:532)
        at org.openhab.binding.milllan.internal.AbstractMillThingHandler.updateConfiguration(AbstractMillThingHandler.java:3572)
        at org.openhab.binding.milllan.internal.AbstractMillThingHandler.pollOpenWindowParameters(AbstractMillThingHandler.java:1776)
        at org.openhab.binding.milllan.internal.MillPanelHeaterHandler$PollInfrequent.run(MillPanelHeaterHandler.java:116)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
org.ops4j.pax.logging.pax-logging-log4j2 [pax-logging] ERROR : 'org.openhab.core.io.websocket.log.LogWebSocket@fe1fea5' is removed as a LogListener since it threw an exception.

So, it seems to me that when the websocket times out, an IllegalStateException is thrown in, but sendMessage() only expects IOExceptions, so it's not caught. As a result, it propagates up the logging framework, which throws it out as a LogListener because it threw an exception. This web socket will thus never receive another log entry.

edit: Actually, I think it might be worse. It seems like the logging thread is used to send the event through the websocket, and the logging framework probably have a timeout mechanism on its thread pool that kills "hanging" loggers.

@Nadahar
Copy link
Contributor Author

Nadahar commented Dec 5, 2025

I found yet another problem: If DEBUG logging is on, so that the logging in WrappedScheduledExecutorService is active, flushing the buffer using an executor might trigger a "circle of doom" where each call to the executor triggers a log entry, which then triggers another call to the executor, which basically goes on "forever" until the system crashes.

I think this class must have its own, independent threads so that there is no logging whatsoever. The act of handling a log entry cannot create a new log entry, or you have the "circle of doom". The only way that I can see to make sure of this, is to completely control the thread pool. But, that might introduce challenges with managing the pool and shutting it down. I'm not sure what controls the lifecycle of a LogWebSocket... is it created by jetty somehow?

@Nadahar
Copy link
Contributor Author

Nadahar commented Dec 6, 2025

I have addressed the things that I believe can be addressed here:

  • FIFO delivery of log events should be ensured.
  • The blocking calls to send data over the websocket is no longer done using the logger framework provided thread.
  • A dedicated (single threaded) executors is used instead of one provided by ThreadPoolManager to make sure that no logging is generated by the executor when performing logging tasks.

The heartbeat/reconnect issues must be solved in MainUI AFAICU.

I've given this a fair bit of testing, and the latest iteration has been dead stable for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LogWebSocket causes thread deadlock which makes OH grind to a halt

3 participants