Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System Cpu usage high #24866

Open
zzmg opened this issue Jan 21, 2025 · 36 comments
Open

System Cpu usage high #24866

zzmg opened this issue Jan 21, 2025 · 36 comments
Labels
kind/bug Something isn't working

Comments

@zzmg
Copy link

zzmg commented Jan 21, 2025

Version & Environment

Redpanda version: (use rpk version):
V24.3.2

What went wrong?

After the cluster has been running for a period of time, suddenly a machine's systemc CPU usage is very high. can you help me?

we use aws ec2, im4gn.4xlarge * 5
49 topics
and topics have no replicas

The metrics about cpu:
Image

it is our topic config,Different topics will modify log.retention and segment.bytes according to the message volume, and the remaining parameters are the same.

Image

The CPU used by the node system is very high. At this time, the CPU of the process is not high, but redpanda_cpu_busy_seconds_total is already 100%. In addition, at this time, producer and consumer data will drop a lot

The open source version I use。

Is it related to the fact that the replica of the topic is 1?

Image Image

What should have happened instead?

How to reproduce the issue?

Additional information

Please attach any relevant logs, backtraces, or metric charts.

JIRA Link: CORE-8858

@piyushredpanda
Copy link
Contributor

@zzmg Please share logs from the cluster to make this actionable.

@zzmg
Copy link
Author

zzmg commented Jan 21, 2025

Please share logs from the cluster to make this actionable.

The time is approximately 19:55. During this period, in addition to INFO logs, the entire cluster only had a dozen WARN logs and no errors.

warn log like that: WARN 2025-01-20 12:06:39,733 [shard 3:main] storage - disk_log_appender.cc:91 - Segment roll lock contested for {kafka/topic_name/11}

And this has happened several times. After it happened, our solution was to add a new machine to the cluster and dispose the problematic node.

do you need ec2 system log ?

we only collect public metrics, don't collect internal metrics

@travisdowns
Copy link
Member

travisdowns commented Jan 21, 2025

warn log like that: WARN 2025-01-20 12:06:39,733 [shard 3:main] storage - disk_log_appender.cc:91 - Segment roll lock contested for {kafka/topic_name/11}

This log message does not indicate a problem, no action needs to be taken.

Is it related to the fact that the replica of the topic is 1?

No.

The most likely explanation is a change in the incoming workload. The network traffic plot also indicates something changes around this time.

Can you look at the kafka_rpc metrics (request/bytes) and the scheduler metrics (runtime per scheduler group)?

If you can share a grafana snapshot of the full legacy dashboard it would be easiest.

@zzmg
Copy link
Author

zzmg commented Jan 21, 2025

runtime per scheduler group:
Image
From the monitoring of runtime per scheduler group, the time point when main rises is consistent with the time point when CPU system usage spikes.

kafka_rpc request:
Image

Image

@zzmg
Copy link
Author

zzmg commented Jan 21, 2025

do you need cluster config and redpanda.yaml or io file?

@travisdowns
Copy link
Member

Can you show the reactor utilization broken down by node (like the "rpc request" on you showed).

@travisdowns
Copy link
Member

Do you have any node-level metrics showing which processes had high CPU usage on the *.137 node?

@travisdowns
Copy link
Member

travisdowns commented Jan 21, 2025

Can you show me the "steal time" reactor metric for node .137 and any one of the other 4 nodes?

@zzmg
Copy link
Author

zzmg commented Jan 21, 2025

Can you show the reactor utilization broken down by node (like the "rpc request" on you showed).

Image

@zzmg
Copy link
Author

zzmg commented Jan 21, 2025

steal time

sum by(instance) (rate(node_cpu_seconds_total{instance="$node",job="$job", mode="steal"}[$__rate_interval])) / on(instance) group_left sum by (instance)((rate(node_cpu_seconds_total{instance="$node",job="$job"}[$__rate_interval]))) this metrics right?
this is for *137, system cpu :
Image

steal time:

Image

About The others nodes ,steal time ,always is 0

@travisdowns
Copy link
Member

travisdowns commented Jan 21, 2025

No, sorry I mean the redpanda metric vectorized_reactor_cpu_steal_time_ms, maybe plotted alongside vectorized_reactor_cpu_busy_ms, use rate.

This is not related to VM steal (which is what the node metric reports).

@zzmg
Copy link
Author

zzmg commented Jan 21, 2025

vectorized_reactor_cpu_steal_time_ms

sorry, vectorized_reactor_cpu_steal_time_ms is internal metrics, wo only collect public metrcis

@zzmg
Copy link
Author

zzmg commented Jan 22, 2025

Are there any other metrics I can provide? And,I am already collect the internal metrics

@travisdowns
Copy link
Member

And,I am already collect the internal metrics

Steal time as mentioned (since I understand you have now collected in internal metrics).

@zzmg
Copy link
Author

zzmg commented Jan 24, 2025

But the steal time when the problem occurs is not collected,I can only provide the current

Image

@travisdowns
Copy link
Member

Does it occur repeatedly? We can wait until it occurs again.

@zzmg
Copy link
Author

zzmg commented Jan 24, 2025

ok,we wait,do i need to add other config and check other config?

@travisdowns
Copy link
Member

I would prefer to wait. I think one possibility is something other than Redpanda is stealing CPU from that one broker during that period.

@zzmg
Copy link
Author

zzmg commented Jan 24, 2025

The processes running on the machine include node-exporter, vector, and system-level processes and redpanda

@travisdowns
Copy link
Member

The CPU used by the node system is very high. At this time, the CPU of the process is not high, but redpanda_cpu_busy_seconds_total is already 100%. In addition, at this time, producer and consumer data will drop a lot

This really indicates that something starts running using system CPU, which is not repanda. Just to double check, when you say "the CPU of the process is not high" that includes both the user + sys CPU times, right?

@zzmg
Copy link
Author

zzmg commented Jan 24, 2025

user is not high, but sys cpu is high,and when it is happened, we use top, most of cpu was used by redpanda process

@travisdowns
Copy link
Member

I am confused what did " the CPU of the process is not high," mean then?

but sys cpu is high,and when it is happened, we use top, most of cpu was used by redpanda process

Do you mean that the high overall (node-wide) system CPU value was roughly equal to the redpanda sys CPU value? Or just that Redpanda was the top process (but not necessarily an equivalence in numbers).

A dump of the top or other tool output could be useful here. Maybe capture kenel stacktraces.

@zzmg
Copy link
Author

zzmg commented Jan 24, 2025

Redpanda was the top process at that time.

from node level metrics, System use is high.
expr : sum by(instance) (irate(node_cpu_seconds_total{instance="$node",job="$job", mode="system"}[$__rate_interval])) / on(instance) group_left sum by (instance)((irate(node_cpu_seconds_total{instance="$node",job="$job"}[$__rate_interval])))

Image

from node level metrics, User use is not high.
expr: sum by(instance) (irate(node_cpu_seconds_total{instance="$node",job="$job", mode="user"}[$__rate_interval])) / on(instance) group_left sum by (instance)((irate(node_cpu_seconds_total{instance="$node",job="$job"}[$__rate_interval])))

Image

from redpanda level metrics:
expr:
avg by([[aggr_criteria]]) (deriv(redpanda_cpu_busy_seconds_total{instance="[[node]]",shard="[[node_shard]]", job=~"$redpanda_cluster.*"}[3m]))

Image

@travisdowns
Copy link
Member

travisdowns commented Jan 24, 2025

Yes, but what I'm saying is that I would like to know if the OS-provided CPU overall system CPU utilization (say 40% over 10 cores, so 400% if treat each core as 100%), lines up with the OS-provided CPU utilization for the Redpanda process (it should be close to 400%). That is, the overall user + system CPU utilization are just a sum of user + system utilization across all processes, plus some pure kernel thread work which may not be associated with any user process.

Looking at reactor utilization can be deceptive in this case because it just shows the % of time redpanda "wanted" to run measured in wall clock time, not CPU utilization. As an example, if Redpanda was running a load that would normally use 30% reactor utilization and nothing else was running on that core, it would be at 30% utilization. If something then started to run on that core and the OS split the CPU resources 50/50 between the two processes then reactor utilization would jump to 60% since 1 out of 2 cycles are stolen, and it's making progress at half the rate in wall-clock time.

So first we need to assess if Redpanda is using the system CPU: to do this we should have both both OS/process level reports of CPU use, as well as the internal "steal time" metric reported by Redpanda.

@zzmg
Copy link
Author

zzmg commented Jan 25, 2025

Yes, @travisdowns you're absolutely right. We need to know whether the Redpanda process is using the system CPU. When the problem occurred, we used the top command, and it showed that Redpanda was the process using the most resources, occupying around ten CPU cores. However, we didn't take a screenshot at that time, nor did we enable process CPU monitoring.
Currently, I've collected the internal metrics. Next time the problem arises, we'll save the top information on the machine. Besides this, what else do I need to do?

@zzmg
Copy link
Author

zzmg commented Jan 27, 2025

@travisdowns are you there ?

@zzmg
Copy link
Author

zzmg commented Jan 27, 2025

It happened again:

Image

@zzmg
Copy link
Author

zzmg commented Jan 27, 2025

top info:
Image

@zzmg
Copy link
Author

zzmg commented Jan 27, 2025

Image Image

@travisdowns
Copy link
Member

Thanks! Can you show the per-shared data by isolating the query to just the problematic instance (IP *.179) and change the aggregation condition to (instance, shard)?

@travisdowns
Copy link
Member

Is it possible to provide the logs for that instance from at least 17:40 to 17:45?

@zzmg
Copy link
Author

zzmg commented Jan 27, 2025

Thanks! Can you show the per-shared data by isolating the query to just the problematic instance (IP *.179) and change the aggregation condition to (instance, shard)?

Image Image Image

@travisdowns
Copy link
Member

Thanks, so it looks like it occurs across all shards (although to varying extents).

From above:

Is it possible to provide the logs for that instance from at least 17:40 to 17:45?

@zzmg
Copy link
Author

zzmg commented Feb 3, 2025

@travisdowns sorry for a long time to reply, here is the log.
redpanda.log
And Thank you very much for taking the time to help us troubleshoot the issue.

@zzmg
Copy link
Author

zzmg commented Feb 7, 2025

@travisdowns Is it related to the configuration parameters of the __consumer_offsets topic? We haven't optimized this topic because we have abundant disk space. However, it seems that the number of segments has been increasing. The node that experienced abnormal CPU usage last time was the one with the largest number of segments.
Image

@zzmg
Copy link
Author

zzmg commented Feb 12, 2025

@travisdowns do you have any ideas?thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants