Skip to content

Comments

websocket disconnections#66

Merged
rgarcia merged 1 commit intomainfrom
raf/kernel-302-websocket-disconnections
Sep 10, 2025
Merged

websocket disconnections#66
rgarcia merged 1 commit intomainfrom
raf/kernel-302-websocket-disconnections

Conversation

@rgarcia
Copy link
Contributor

@rgarcia rgarcia commented Sep 10, 2025

  • changed run-unikernel scripts to match resources used in prod
  • added an env var flag LOG_CDP_MESSAGES that will log a condensed form of each CDP message
  • removed the heartbeat ping. Turns out browser use sometimes asks a /lot/ of the CDP connection when it first begins. It pulls the accessibilty tree plus full document which can be many MBs. Was seeing this ping time out during this initial startup, causing us to close the connection.
  • adjusted websocket dialer and connection to be very lenient
  • changed the uk-check-stats script to compute a real cpu % and a KB/s network rate

todo:

  • test with API in the middle
  • test keeping the ping but not killing the connection if it times out

TL;DR

Fixes websocket disconnections by removing the aggressive heartbeat ping that was causing timeouts during heavy initial browser load.

Why we made these changes

During initial startup, the browser's Chrome DevTools Protocol (CDP) connection can be saturated with large amounts of data (e.g., accessibility tree, full document), which can be many megabytes. The existing heartbeat ping was timing out during this heavy load, causing us to prematurely close valid connections. These changes make the connection more lenient and robust, especially during startup.

What changed?

  • Websocket Stability: Removed the custom heartbeat ping logic in server/lib/devtoolsproxy/proxy.go to prevent timeouts. Adjusted websocket dialer settings to be more lenient with larger buffers, compression, and no deadlines.
  • Debugging & Observability:
    • Added a LOG_CDP_MESSAGES environment variable to log a condensed form of each CDP message for easier debugging.
    • Enhanced the shared/uk-check-stats.sh script to compute and display a real CPU percentage and network throughput in KB/s.
  • Configuration:
    • Updated run-unikernel.sh scripts for both chromium-headful (4 vCPUs) and chromium-headless (1 vCPU) to align resource usage with production environments.

Description generated by Mesa. Update settings

@rgarcia rgarcia requested a review from Sayan- September 10, 2025 18:41
cursor[bot]

This comment was marked as outdated.

Copy link

@mesa-dot-dev mesa-dot-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performed full review of 330685e...cd0123c

Tip

⚡ Quick Actions

This review was generated by Mesa.

Actions:

Slash Commands:

  • /review - Request a full code review
  • /review latest - Review only changes since the last review
  • /describe - Generate PR description. This will update the PR body or issue comment depending on your configuration
  • /help - Get help with Mesa commands and configuration options

9 files reviewed | 3 comments | Review on Mesa | Edit Reviewer Settings

-e LOG_CDP_MESSAGES=true \
-p 9222:9222/tls
-p 444:10001/tls
--vcpus 2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Logic

There's a conflicting vcpus configuration. Line 18 sets --vcpus 1 but line 24 sets --vcpus 2. The last one will override the first, so this instance will actually use 2 CPUs, not 1 as intended by the first setting.
Agent: 🤖 General

clientConn.SetReadDeadline(time.Time{}) // No timeout--hold on to connections for dear life
clientConn.SetWriteDeadline(time.Time{}) // No timeout--hold on to connections for dear life
clientConn.SetReadLimit(100 * 1024 * 1024) // 100 MB. Effectively no maximum size of message from client
clientConn.EnableWriteCompression(true)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Security

Setting ReadLimit to 100MB could potentially expose the system to memory exhaustion attacks if a malicious client sends very large messages. Consider if this limit is truly necessary or if a more conservative limit (e.g., 10MB) would still handle legitimate CDP use cases.
Agent: 🤖 General


// Extract fields using regex from raw message
rawMsg := string(msg)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Logic

The regex-based JSON parsing in logCDPMessage is fragile and could fail with escaped quotes, nested objects, or malformed JSON. Consider using json.Unmarshal with a simple struct that captures the fields you need, which would be more robust and performant.
Agent: 🤖 General

Copy link
Contributor

@Sayan- Sayan- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm! A few questions I had prior to merging:

  1. How'd we decide the buffer sizes? What's the behavior when those buffers get full?
  2. Any concerns with additional resource usage enabling compression? Any other changes we need to make to enable the compression? I'd expect it to just work with chromium but unclear from our intermediate hops. From the docs I see:

This function is a noop if compression was not negotiated with the peer.

@rgarcia
Copy link
Contributor Author

rgarcia commented Sep 10, 2025

  1. From my reading it seems like the behavior when those buffers get filled is network lag--basically the buffer gets filled and then the connection stops reading data. Making them big hopefully accommodates big spikes in network activity with less lag
  2. I think the overhead is pretty minimal of compression but yes it depends on the client asking for it (in our case, the API). I mainly enabled it to minimize stress on the network. I was not seeing CPU limit (4) getting hit in my observations of uk-check-stats output

Copy link
Contributor

@Sayan- Sayan- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== to both thanks for confirming! I haven't read the gorilla implementation deeply enough to confirm whether it would drop packets / eventually terminate the connection when the buffer gets filled. Cross checking their current code and we're a 16x bump from the default, which I think we can continue to bump without too many worries (https://github.com/gorilla/websocket/blob/main/conn.go#L36-L37)

@rgarcia rgarcia force-pushed the raf/kernel-302-websocket-disconnections branch from 7cdd3f8 to 9aa81ef Compare September 10, 2025 20:11
@rgarcia rgarcia merged commit 577a24a into main Sep 10, 2025
6 of 7 checks passed
@rgarcia rgarcia deleted the raf/kernel-302-websocket-disconnections branch September 10, 2025 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants