Unavailable Server Causes Controller Load Issues #14930

itschrispeck · 2025-01-28T00:20:55Z

Core issue:
Pinot is unable to safely ingest/serve queries from remaining replicas for a prolonged period due to some sort of retry logic impacting controller functionality.

Background:
Recently we saw a controller struggling to process ZKEvents as fast as they were created. This began happening after a server failed to start due to a deadlock condition, and was left in this state for a few days. Controller CPU is elevated during this period, and eventually the throughput of callbacks/events is too high for processing to keep up:

It looks like the slow event processing was due to resource starvation, with Helix's ZKEventThread presumably struggling to be scheduled. From our metrics, we see a huge increase in ZK transaction volume (metric is tx log size, which is flushed every 1h):

Looking at snapshot of the cluster during this time, it seems likely that the transactions were under the dead server's MESSAGES znode:

(CONNECTED [localhost:55179]) /pinot/pinot-<redacted>/<redacted>-cluster/INSTANCES/eb92c571-ca4e-4035-8bf0-fc09a9c40e4b> stat MESSAGES
Stat(
  czxid=0x20000098a
  mzxid=0x20000098a
  ctime=1731131435565
  mtime=1731131435565
  version=0
  cversion=92797988
  aversion=0
  ephemeralOwner=0x0
  dataLength=0
  numChildren=3348
  pzxid=0x30f152e04
)

For reference, other servers in this cluster have a ~200-300k cversion. However, when looking at the messages themselves I see the message znodes are created and unmodified for a long time - to me it is not yet clear which child znodes are being modified.

Another phrasing of the issue may be: failed messages continue to load controller/ZK even after failing:

One note about cluster/table setup: we use minion for upsert compaction, which generates a lot more messages than is typical for a realtime table of this size.

Has anyone seen something similar? I haven't yet walked through the relevant helix code. The end goal of raising this issue is to understand how we can prevent a dead server causing such a large load increase on controller and ZK.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unavailable Server Causes Controller Load Issues #14930

Unavailable Server Causes Controller Load Issues #14930

itschrispeck commented Jan 28, 2025 •

edited

Loading

Unavailable Server Causes Controller Load Issues #14930

Unavailable Server Causes Controller Load Issues #14930

Comments

itschrispeck commented Jan 28, 2025 • edited Loading

itschrispeck commented Jan 28, 2025 •

edited

Loading