-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System Cpu usage high #24866
Comments
@zzmg Please share logs from the cluster to make this actionable. |
The time is approximately 19:55. During this period, in addition to INFO logs, the entire cluster only had a dozen WARN logs and no errors. warn log like that: WARN 2025-01-20 12:06:39,733 [shard 3:main] storage - disk_log_appender.cc:91 - Segment roll lock contested for {kafka/topic_name/11} And this has happened several times. After it happened, our solution was to add a new machine to the cluster and dispose the problematic node. do you need ec2 system log ? we only collect public metrics, don't collect internal metrics |
This log message does not indicate a problem, no action needs to be taken.
No. The most likely explanation is a change in the incoming workload. The network traffic plot also indicates something changes around this time. Can you look at the kafka_rpc metrics (request/bytes) and the scheduler metrics (runtime per scheduler group)? If you can share a grafana snapshot of the full legacy dashboard it would be easiest. |
do you need cluster config and redpanda.yaml or io file? |
Can you show the reactor utilization broken down by node (like the "rpc request" on you showed). |
Do you have any node-level metrics showing which processes had high CPU usage on the *.137 node? |
Can you show me the "steal time" reactor metric for node .137 and any one of the other 4 nodes? |
No, sorry I mean the redpanda metric This is not related to VM steal (which is what the node metric reports). |
sorry, vectorized_reactor_cpu_steal_time_ms is internal metrics, wo only collect public metrcis |
Are there any other metrics I can provide? And,I am already collect the internal metrics |
Steal time as mentioned (since I understand you have now collected in internal metrics). |
Does it occur repeatedly? We can wait until it occurs again. |
ok,we wait,do i need to add other config and check other config? |
I would prefer to wait. I think one possibility is something other than Redpanda is stealing CPU from that one broker during that period. |
The processes running on the machine include node-exporter, vector, and system-level processes and redpanda |
This really indicates that something starts running using system CPU, which is not repanda. Just to double check, when you say "the CPU of the process is not high" that includes both the user + sys CPU times, right? |
user is not high, but sys cpu is high,and when it is happened, we use top, most of cpu was used by redpanda process |
I am confused what did " the CPU of the process is not high," mean then?
Do you mean that the high overall (node-wide) system CPU value was roughly equal to the redpanda sys CPU value? Or just that Redpanda was the top process (but not necessarily an equivalence in numbers). A dump of the top or other tool output could be useful here. Maybe capture kenel stacktraces. |
Yes, but what I'm saying is that I would like to know if the OS-provided CPU overall system CPU utilization (say 40% over 10 cores, so 400% if treat each core as 100%), lines up with the OS-provided CPU utilization for the Redpanda process (it should be close to 400%). That is, the overall user + system CPU utilization are just a sum of user + system utilization across all processes, plus some pure kernel thread work which may not be associated with any user process. Looking at reactor utilization can be deceptive in this case because it just shows the % of time redpanda "wanted" to run measured in wall clock time, not CPU utilization. As an example, if Redpanda was running a load that would normally use 30% reactor utilization and nothing else was running on that core, it would be at 30% utilization. If something then started to run on that core and the OS split the CPU resources 50/50 between the two processes then reactor utilization would jump to 60% since 1 out of 2 cycles are stolen, and it's making progress at half the rate in wall-clock time. So first we need to assess if Redpanda is using the system CPU: to do this we should have both both OS/process level reports of CPU use, as well as the internal "steal time" metric reported by Redpanda. |
Yes, @travisdowns you're absolutely right. We need to know whether the Redpanda process is using the system CPU. When the problem occurred, we used the top command, and it showed that Redpanda was the process using the most resources, occupying around ten CPU cores. However, we didn't take a screenshot at that time, nor did we enable process CPU monitoring. |
@travisdowns are you there ? |
Thanks! Can you show the per-shared data by isolating the query to just the problematic instance (IP *.179) and change the aggregation condition to |
Is it possible to provide the logs for that instance from at least 17:40 to 17:45? |
Thanks, so it looks like it occurs across all shards (although to varying extents). From above:
|
@travisdowns sorry for a long time to reply, here is the log. |
@travisdowns Is it related to the configuration parameters of the __consumer_offsets topic? We haven't optimized this topic because we have abundant disk space. However, it seems that the number of segments has been increasing. The node that experienced abnormal CPU usage last time was the one with the largest number of segments. |
@travisdowns do you have any ideas?thanks |
Version & Environment
Redpanda version: (use
rpk version
):V24.3.2
What went wrong?
After the cluster has been running for a period of time, suddenly a machine's systemc CPU usage is very high. can you help me?
we use aws ec2, im4gn.4xlarge * 5
49 topics
and topics have no replicas
The metrics about cpu:
![Image](https://private-user-images.githubusercontent.com/15256940/405239683-5185f367-d6a3-427a-a0f8-90a2cc7e42e0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0NTEwNjQsIm5iZiI6MTczOTQ1MDc2NCwicGF0aCI6Ii8xNTI1Njk0MC80MDUyMzk2ODMtNTE4NWYzNjctZDZhMy00MjdhLWEwZjgtOTBhMmNjN2U0MmUwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDEyNDYwNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTY0ZWNhNTZkN2UzNzBhMWRmOGQ4Mjc3OWU1MDQ4YTQ4MzgwZDcwZWNjYzhiOWE0MmE4YTA2NTg2MWQwZTE5NDYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.1BzSTKDtPfSynx1HSLQ-v_lIWI_stGz8zI8ZjRvXNaY)
it is our topic config,Different topics will modify log.retention and segment.bytes according to the message volume, and the remaining parameters are the same.
The CPU used by the node system is very high. At this time, the CPU of the process is not high, but redpanda_cpu_busy_seconds_total is already 100%. In addition, at this time, producer and consumer data will drop a lot
The open source version I use。
Is it related to the fact that the replica of the topic is 1?
What should have happened instead?
How to reproduce the issue?
Additional information
Please attach any relevant logs, backtraces, or metric charts.
JIRA Link: CORE-8858
The text was updated successfully, but these errors were encountered: