System Cpu usage high #24866

zzmg · 2025-01-21T13:39:02Z

Version & Environment

Redpanda version: (use rpk version):
V24.3.2

What went wrong?

After the cluster has been running for a period of time, suddenly a machine's systemc CPU usage is very high. can you help me？

we use aws ec2， im4gn.4xlarge * 5
49 topics
and topics have no replicas

The metrics about cpu：

it is our topic config，Different topics will modify log.retention and segment.bytes according to the message volume, and the remaining parameters are the same.

The CPU used by the node system is very high. At this time, the CPU of the process is not high, but redpanda_cpu_busy_seconds_total is already 100%. In addition, at this time, producer and consumer data will drop a lot

The open source version I use。

Is it related to the fact that the replica of the topic is 1?

What should have happened instead?

How to reproduce the issue?

Additional information

Please attach any relevant logs, backtraces, or metric charts.

JIRA Link: CORE-8858

The text was updated successfully, but these errors were encountered:

piyushredpanda · 2025-01-21T13:41:16Z

@zzmg Please share logs from the cluster to make this actionable.

zzmg · 2025-01-21T13:51:41Z

Please share logs from the cluster to make this actionable.

The time is approximately 19:55. During this period, in addition to INFO logs, the entire cluster only had a dozen WARN logs and no errors.

warn log like that： WARN 2025-01-20 12:06:39,733 [shard 3:main] storage - disk_log_appender.cc:91 - Segment roll lock contested for {kafka/topic_name/11}

And this has happened several times. After it happened, our solution was to add a new machine to the cluster and dispose the problematic node.

do you need ec2 system log ?

we only collect public metrics, don't collect internal metrics

travisdowns · 2025-01-21T13:57:26Z

warn log like that： WARN 2025-01-20 12:06:39,733 [shard 3:main] storage - disk_log_appender.cc:91 - Segment roll lock contested for {kafka/topic_name/11}

This log message does not indicate a problem, no action needs to be taken.

Is it related to the fact that the replica of the topic is 1?

No.

The most likely explanation is a change in the incoming workload. The network traffic plot also indicates something changes around this time.

Can you look at the kafka_rpc metrics (request/bytes) and the scheduler metrics (runtime per scheduler group)?

If you can share a grafana snapshot of the full legacy dashboard it would be easiest.

zzmg · 2025-01-21T14:38:00Z

runtime per scheduler group:

From the monitoring of runtime per scheduler group, the time point when main rises is consistent with the time point when CPU system usage spikes.

kafka_rpc request：

zzmg · 2025-01-21T14:43:01Z

do you need cluster config and redpanda.yaml or io file?

travisdowns · 2025-01-21T18:16:47Z

Can you show the reactor utilization broken down by node (like the "rpc request" on you showed).

travisdowns · 2025-01-21T18:18:51Z

Do you have any node-level metrics showing which processes had high CPU usage on the *.137 node?

travisdowns · 2025-01-21T18:19:32Z

Can you show me the "steal time" reactor metric for node .137 and any one of the other 4 nodes?

zzmg · 2025-01-21T18:29:26Z

Can you show the reactor utilization broken down by node (like the "rpc request" on you showed).

zzmg · 2025-01-21T18:34:46Z

steal time

sum by(instance) (rate(node_cpu_seconds_total{instance="$node",job="$job", mode="steal"}[$__rate_interval])) / on(instance) group_left sum by (instance)((rate(node_cpu_seconds_total{instance="$node",job="$job"}[$__rate_interval]))) this metrics right?
this is for *137, system cpu :

steal time:

About The others nodes ,steal time ,always is 0

travisdowns · 2025-01-21T19:08:00Z

No, sorry I mean the redpanda metric vectorized_reactor_cpu_steal_time_ms, maybe plotted alongside vectorized_reactor_cpu_busy_ms, use rate.

This is not related to VM steal (which is what the node metric reports).

zzmg · 2025-01-21T22:54:07Z

vectorized_reactor_cpu_steal_time_ms

sorry, vectorized_reactor_cpu_steal_time_ms is internal metrics, wo only collect public metrcis

zzmg · 2025-01-22T08:40:40Z

Are there any other metrics I can provide? And，I am already collect the internal metrics

travisdowns · 2025-01-24T02:11:14Z

And，I am already collect the internal metrics

Steal time as mentioned (since I understand you have now collected in internal metrics).

zzmg · 2025-01-24T03:20:45Z

But the steal time when the problem occurs is not collected，I can only provide the current

travisdowns · 2025-01-24T03:33:04Z

Does it occur repeatedly? We can wait until it occurs again.

zzmg · 2025-01-24T03:43:00Z

ok，we wait，do i need to add other config and check other config？

travisdowns · 2025-01-24T04:09:31Z

I would prefer to wait. I think one possibility is something other than Redpanda is stealing CPU from that one broker during that period.

zzmg · 2025-01-24T04:12:16Z

The processes running on the machine include node-exporter, vector, and system-level processes and redpanda

travisdowns · 2025-01-24T04:21:22Z

The CPU used by the node system is very high. At this time, the CPU of the process is not high, but redpanda_cpu_busy_seconds_total is already 100%. In addition, at this time, producer and consumer data will drop a lot

This really indicates that something starts running using system CPU, which is not repanda. Just to double check, when you say "the CPU of the process is not high" that includes both the user + sys CPU times, right?

zzmg · 2025-01-24T04:23:52Z

user is not high, but sys cpu is high,and when it is happened, we use top, most of cpu was used by redpanda process

travisdowns · 2025-01-24T04:26:41Z

I am confused what did " the CPU of the process is not high," mean then?

but sys cpu is high,and when it is happened, we use top, most of cpu was used by redpanda process

Do you mean that the high overall (node-wide) system CPU value was roughly equal to the redpanda sys CPU value? Or just that Redpanda was the top process (but not necessarily an equivalence in numbers).

A dump of the top or other tool output could be useful here. Maybe capture kenel stacktraces.

zzmg · 2025-01-24T06:08:54Z

Redpanda was the top process at that time.

from node level metrics, System use is high.
expr : sum by(instance) (irate(node_cpu_seconds_total{instance="$node",job="$job", mode="system"}[$__rate_interval])) / on(instance) group_left sum by (instance)((irate(node_cpu_seconds_total{instance="$node",job="$job"}[$__rate_interval])))

from node level metrics, User use is not high.
expr: sum by(instance) (irate(node_cpu_seconds_total{instance="$node",job="$job", mode="user"}[$__rate_interval])) / on(instance) group_left sum by (instance)((irate(node_cpu_seconds_total{instance="$node",job="$job"}[$__rate_interval])))

from redpanda level metrics:
expr:
avg by([[aggr_criteria]]) (deriv(redpanda_cpu_busy_seconds_total{instance=~~"[[node]]",shard=~~"[[node_shard]]", job=~"$redpanda_cluster.*"}[3m]))

travisdowns · 2025-01-24T15:30:09Z

Yes, but what I'm saying is that I would like to know if the OS-provided CPU overall system CPU utilization (say 40% over 10 cores, so 400% if treat each core as 100%), lines up with the OS-provided CPU utilization for the Redpanda process (it should be close to 400%). That is, the overall user + system CPU utilization are just a sum of user + system utilization across all processes, plus some pure kernel thread work which may not be associated with any user process.

Looking at reactor utilization can be deceptive in this case because it just shows the % of time redpanda "wanted" to run measured in wall clock time, not CPU utilization. As an example, if Redpanda was running a load that would normally use 30% reactor utilization and nothing else was running on that core, it would be at 30% utilization. If something then started to run on that core and the OS split the CPU resources 50/50 between the two processes then reactor utilization would jump to 60% since 1 out of 2 cycles are stolen, and it's making progress at half the rate in wall-clock time.

So first we need to assess if Redpanda is using the system CPU: to do this we should have both both OS/process level reports of CPU use, as well as the internal "steal time" metric reported by Redpanda.

zzmg · 2025-01-25T03:55:55Z

Yes, @travisdowns you're absolutely right. We need to know whether the Redpanda process is using the system CPU. When the problem occurred, we used the top command, and it showed that Redpanda was the process using the most resources, occupying around ten CPU cores. However, we didn't take a screenshot at that time, nor did we enable process CPU monitoring.
Currently, I've collected the internal metrics. Next time the problem arises, we'll save the top information on the machine. Besides this, what else do I need to do?

zzmg · 2025-01-27T09:51:55Z

@travisdowns are you there ？

zzmg · 2025-01-27T09:53:10Z

It happened again：

zzmg · 2025-01-27T10:00:08Z

top info：

zzmg · 2025-01-27T10:01:21Z

travisdowns · 2025-01-27T14:57:07Z

Thanks! Can you show the per-shared data by isolating the query to just the problematic instance (IP *.179) and change the aggregation condition to (instance, shard)?

travisdowns · 2025-01-27T14:57:36Z

Is it possible to provide the logs for that instance from at least 17:40 to 17:45?

zzmg · 2025-01-27T15:26:41Z

Thanks! Can you show the per-shared data by isolating the query to just the problematic instance (IP *.179) and change the aggregation condition to (instance, shard)?

travisdowns · 2025-01-28T02:07:14Z

Thanks, so it looks like it occurs across all shards (although to varying extents).

From above:

Is it possible to provide the logs for that instance from at least 17:40 to 17:45?

zzmg · 2025-02-03T04:10:23Z

@travisdowns sorry for a long time to reply, here is the log.
redpanda.log
And Thank you very much for taking the time to help us troubleshoot the issue.

zzmg · 2025-02-07T04:11:58Z

@travisdowns Is it related to the configuration parameters of the __consumer_offsets topic? We haven't optimized this topic because we have abundant disk space. However, it seems that the number of segments has been increasing. The node that experienced abnormal CPU usage last time was the one with the largest number of segments.

zzmg · 2025-02-12T10:14:23Z

@travisdowns do you have any ideas？thanks

zzmg added the kind/bug Something isn't working label Jan 21, 2025

zzmg mentioned this issue Jan 21, 2025

about this metrics redpanda_io_queue_total_read_ops has some unexpected behavior #24768

Closed

System Cpu usage high #24866

System Cpu usage high #24866

Comments

zzmg commented Jan 21, 2025 • edited by github-actions bot Loading

Version & Environment

What went wrong?

What should have happened instead?

How to reproduce the issue?

Additional information

piyushredpanda commented Jan 21, 2025

zzmg commented Jan 21, 2025 • edited Loading

travisdowns commented Jan 21, 2025 • edited Loading

zzmg commented Jan 21, 2025

zzmg commented Jan 21, 2025

travisdowns commented Jan 21, 2025

travisdowns commented Jan 21, 2025

travisdowns commented Jan 21, 2025 • edited Loading

zzmg commented Jan 21, 2025

zzmg commented Jan 21, 2025

travisdowns commented Jan 21, 2025 • edited Loading

zzmg commented Jan 21, 2025 • edited Loading

zzmg commented Jan 22, 2025

travisdowns commented Jan 24, 2025

zzmg commented Jan 24, 2025

travisdowns commented Jan 24, 2025

zzmg commented Jan 24, 2025

travisdowns commented Jan 24, 2025

zzmg commented Jan 24, 2025

travisdowns commented Jan 24, 2025

zzmg commented Jan 24, 2025

travisdowns commented Jan 24, 2025

zzmg commented Jan 24, 2025

travisdowns commented Jan 24, 2025 • edited Loading

zzmg commented Jan 25, 2025

zzmg commented Jan 27, 2025

zzmg commented Jan 27, 2025 • edited Loading

zzmg commented Jan 27, 2025

zzmg commented Jan 27, 2025

travisdowns commented Jan 27, 2025

travisdowns commented Jan 27, 2025

zzmg commented Jan 27, 2025

travisdowns commented Jan 28, 2025

zzmg commented Feb 3, 2025 • edited Loading

zzmg commented Feb 7, 2025

zzmg commented Feb 12, 2025

zzmg commented Jan 21, 2025 •

edited by github-actions bot

Loading

zzmg commented Jan 21, 2025 •

edited

Loading

travisdowns commented Jan 21, 2025 •

edited

Loading

travisdowns commented Jan 21, 2025 •

edited

Loading

travisdowns commented Jan 21, 2025 •

edited

Loading

zzmg commented Jan 21, 2025 •

edited

Loading

travisdowns commented Jan 24, 2025 •

edited

Loading

zzmg commented Jan 27, 2025 •

edited

Loading

zzmg commented Feb 3, 2025 •

edited

Loading