Skip to content

Commit 304ad28

Browse files
committed
docs: add documentation about setting up alarms
Signed-off-by: Guillaume <[email protected]>
1 parent 9f64a10 commit 304ad28

File tree

1 file changed

+218
-0
lines changed

1 file changed

+218
-0
lines changed

doc/content/xapi/alarms/index.md

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
+++
2+
title = "How to set up alarms"
3+
linkTitle = "Alarms"
4+
+++
5+
6+
# Introduction
7+
8+
In XAPI, alarms are triggered by a Python daemon located at `/opt/xensource/bin/perfmon`.
9+
The daemon is managed as a systemd service and can be configured by setting parameters in `/etc/sysconfig/perfmon`.
10+
11+
It listens on an internal Unix socket to receive commands. Otherwise, it runs in a loop, periodically requesting metrics from XAPI. It can then be configured to generate events based on these metrics. It can monitor various types of XAPI objects, including `VMs`, `SRs`, and `Hosts`. The configuration for each object is defined by writing an XML string into the object's `other-config` key.
12+
13+
The metrics used by `perfmon` are collected by the `xcp-rrdd` daemon. The `xcp-rrdd` daemon is a component of XAPI responsible for collecting metrics and storing them as Round-Robin Databases (RRDs).
14+
15+
A XAPI plugin also exists, providing the functions `refresh` and `debug_mem`, which send commands through the Unix socket. The `refresh` function is used when an `other-config` key is added or updated; it triggers the daemon to reread the monitored objects so that new alerts are taken into account. The `debug_mem` function logs the objects currently being monitored into `/var/log/user.log` as a dictionary.
16+
17+
# Monitoring and alarms
18+
19+
## Overview
20+
21+
- To get the metrics, `perfmon` requests XAPI by calling: `http://localhost/rrd_updates?session_id=<ref>&start=1759912021&host=true&sr_uuid=all&cf=AVERAGE&interval=60`
22+
- Different consolidation functions can be used like **AVERAGE**, **MIN**, **MAX** or **LAST**. See the details in the next sections for specific objects and how to set it.
23+
- Once retrieve, `perfmon` will check all its triggers and generate alarms if needed.
24+
25+
## Specific XAPI objects
26+
### VMs
27+
28+
- To set an alarm on a VM, you need to write an XML string into the `other-config` key of the object. For example, to trigger an alarm when the CPU usage is higher than 50%, run:
29+
```sh
30+
xe vm-param-set uuid=<UUID> other-config:perfmon='<config> <variable> <name value="cpu_usage"/> <alarm_trigger_level value="0.5"/> </variable> </config>'
31+
```
32+
33+
- Then, you can either wait until the new configuration is read by the `perfmon` daemon or force a refresh by running:
34+
```sh
35+
xe host-call-plugin host-uuid=<UUID> plugin=perfmon fn=refresh
36+
```
37+
38+
- Now, if you generate some load inside the VM and the CPU usage goes above 50%, the `perfmon` daemon will create a message (a XAPI object) with the name **ALARM**. This message will include a _priority_, a _timestamp_, an _obj-uuid_ and a _body_. To list all messages that are alarms, run:
39+
```sh
40+
xe message-list name=ALARM
41+
```
42+
43+
- You will see, for example:
44+
```sh
45+
uuid ( RO) : dadd7cbc-cb4e-5a56-eb0b-0bb31c102c94
46+
name ( RO): ALARM
47+
priority ( RO): 3
48+
class ( RO): VM
49+
obj-uuid ( RO): ea9efde2-d0f2-34bb-74cb-78c303f65d89
50+
timestamp ( RO): 20251007T11:30:26Z
51+
body ( RO): value: 0.986414
52+
config:
53+
<variable>
54+
55+
<name value="cpu_usage"/>
56+
57+
<alarm_trigger_level value="0.5"/>
58+
59+
</variable>
60+
```
61+
- where the _body_ contains all the relevant information: the value that triggered the alarm and the configuration of your alarm.
62+
63+
- When configuring you alarm, your XML string can:
64+
- have multiple `<variable>` nodes
65+
- use the following values for child nodes:
66+
* **name**: what to call the variable (no default)
67+
* **alarm_priority**: the priority of the messages generated (default '3')
68+
* **alarm_trigger_level**: level of value that triggers an alarm (no default)
69+
* **alarm_trigger_sense**:'high' if alarm_trigger_level is a max, otherwise 'low'. (default 'high')
70+
* **alarm_trigger_period**: num seconds of 'bad' values before an alarm is sent (default '60')
71+
* **alarm_auto_inhibit_period**: num seconds this alarm disabled after an alarm is sent (default '3600')
72+
* **consolidation_fn**: how to combine variables from rrd_updates into one value (default is 'average' for 'cpu_usage', 'get_percent_fs_usage' for 'fs_usage', 'get_percent_log_fs_usage' for 'log_fs_usage','get_percent_mem_usage' for 'mem_usage', & 'sum' for everything else)
73+
* **rrd_regex** matches the names of variables from (xe vm-data-sources-list uuid=$vmuuid) used to compute value (only has defaults for "cpu_usage", "network_usage", and "disk_usage")
74+
75+
- Notice that `alarm_priority` will be the priority of the generated `message`, 0 being low priority.
76+
77+
### SRs
78+
79+
- To set an alarm on an SR object, as with VMs, you need to write an XML string into the `other-config` key of the SR. For example, you can run:
80+
```sh
81+
xe sr-param-set uuid=<UUID> other-config:perfmon='<config><variable><name value="physical_utilisation"/><alarm_trigger_level value="0.8"/></variable></config>'
82+
```
83+
- When configuring you alarm, the XML string supports the same child elements as for VMs
84+
85+
### Hosts
86+
87+
- As with VMs ans SRs, alarms can be configured by writing an XML string into an `other-config` key. For example, you can run:
88+
```sh
89+
xe host-param-set uuid=<UUID> other-config:perfmon=\
90+
'<config><variable><name value="cpu_usage"/><alarm_trigger_level value="0.5"/></variable></config>'
91+
```
92+
93+
- The XML string can include multiple <variable> nodes allowed
94+
- The full list of supported child nodes is:
95+
* **name**: what to call the variable (no default)
96+
* **alarm_priority**: the priority of the messages generated (default '3')
97+
* **alarm_trigger_level**: level of value that triggers an alarm (no default)
98+
* **alarm_trigger_sense**: 'high' if alarm_trigger_level is a max, otherwise 'low'. (default 'high')
99+
* **alarm_trigger_period**: num seconds of 'bad' values before an alarm is sent (default '60')
100+
* **alarm_auto_inhibit_period**:num seconds this alarm disabled after an alarm is sent (default '3600')
101+
* **consolidation_fn**: how to combine variables from rrd_updates into one value (default is 'average' for 'cpu_usage' & 'sum' for everything else)
102+
* **rrd_regex** matches the names of variables from (xe host-data-source-list uuid=<UUID>) used to compute value (only has defaults for "cpu_usage", "network_usage", "memory_free_kib" and "sr_io_throughput_total_xxxxxxxx") where that last one ends with the first eight characters of the SR UUID)
103+
104+
- As a special case for SR throughput, it is also possible to configure a Host by writing XML into the `other-config` key of an SR connected to it. For example:
105+
```sh
106+
xe sr-param-set uuid=$sruuid other-config:perfmon=\
107+
'<config><variable><name value="sr_io_throughput_total_per_host"/><alarm_trigger_level value="0.01"/></variable></config>'
108+
```
109+
- This only works for that specific variable name, and `rrd_regex` must not be specified.
110+
- Configuration done directly on the host (variable-name, sr_io_throughput_total_xxxxxxxx) takes priority.
111+
112+
## Which metrics are available?
113+
114+
- Accepted name for metrics are:
115+
- **cpu_usage**: matches RRD metrics with the pattern `cpu[0-9]+`
116+
- **network_usage**: matches RRD metrics with the pattern `vif_[0-9]+_[rt]x`
117+
- **disk_usage**: match RRD metrics with the pattern `vbd_(xvd|hd)[a-z]+_(read|write)`
118+
- **fs_usage**, **log_fs_usage**, **mem_usage** and **memory_internal_free** do not match anything by default.
119+
- By using `rrd_regex`, you can add your own expressions. To get a list of available metrics with their descriptions, you can call the `get_data_sources` method for [VM](https://xapi-project.github.io/new-docs/xen-api/classes/vm/), for [SR](https://xapi-project.github.io/new-docs/xen-api/classes/sr/) and also for [Host](https://xapi-project.github.io/new-docs/xen-api/classes/host/).
120+
- A python script is provided at the end to get data sources. Using the script we can, for example, see:
121+
```sh
122+
# ./get_data_sources.py --vm 5a445deb-0a8e-c6fe-24c8-09a0508bbe21
123+
124+
List of data sources related to VM 5a445deb-0a8e-c6fe-24c8-09a0508bbe21
125+
cpu0 | CPU0 usage
126+
cpu_usage | Domain CPU usage
127+
memory | Memory currently allocated to VM
128+
memory_internal_free | Memory used as reported by the guest agent
129+
memory_target | Target of VM balloon driver
130+
...
131+
vbd_xvda_io_throughput_read | Data read from the VDI, in MiB/s
132+
...
133+
```
134+
- You can then set up an alarm when the data read from a VDI exceeds a certain level by doing:
135+
```
136+
xe vm-param-set uuid=5a445deb-0a8e-c6fe-24c8-09a0508bbe21 \
137+
other-config:perfmon='<config><variable> \
138+
<name value="disk_usage"/> \
139+
<alarm_trigger_level value="10"/> \
140+
<rrd_regex value="vbd_xvda_io_throughput_read"/> \
141+
</variable> </config>'
142+
```
143+
- Here is the script that allows you to get data sources:
144+
```python
145+
#!/usr/bin/env python3
146+
147+
import argparse
148+
import sys
149+
import XenAPI
150+
151+
152+
def pretty_print(data_sources):
153+
if not data_sources:
154+
print("No data sources.")
155+
return
156+
157+
# Compute alignment for something nice
158+
max_label_len = max(len(data["name_label"]) for data in data_sources)
159+
160+
for data in data_sources:
161+
label = data["name_label"]
162+
desc = data["name_description"]
163+
print(f"{label:<{max_label_len}} | {desc}")
164+
165+
166+
def list_vm_data(session, uuid):
167+
vm_ref = session.xenapi.VM.get_by_uuid(uuid)
168+
data_sources = session.xenapi.VM.get_data_sources(vm_ref)
169+
print(f"\nList of data sources related to VM {uuid}")
170+
pretty_print(data_sources)
171+
172+
173+
def list_host_data(session, uuid):
174+
host_ref = session.xenapi.host.get_by_uuid(uuid)
175+
data_sources = session.xenapi.host.get_data_sources(host_ref)
176+
print(f"\nList of data sources related to Host {uuid}")
177+
pretty_print(data_sources)
178+
179+
180+
def list_sr_data(session, uuid):
181+
sr_ref = session.xenapi.SR.get_by_uuid(uuid)
182+
data_sources = session.xenapi.SR.get_data_sources(sr_ref)
183+
print(f"\nList of data sources related to SR {uuid}")
184+
pretty_print(data_sources)
185+
186+
187+
def main():
188+
parser = argparse.ArgumentParser(
189+
description="List data sources related to VM, host or SR"
190+
)
191+
parser.add_argument("--vm", help="VM UUID")
192+
parser.add_argument("--host", help="Host UUID")
193+
parser.add_argument("--sr", help="SR UUID")
194+
195+
args = parser.parse_args()
196+
197+
# Connect to local XAPI: no identification required to access local socket
198+
session = XenAPI.xapi_local()
199+
200+
try:
201+
session.xenapi.login_with_password("", "")
202+
if args.vm:
203+
list_vm_data(session, args.vm)
204+
if args.host:
205+
list_host_data(session, args.host)
206+
if args.sr:
207+
list_sr_data(session, args.sr)
208+
except XenAPI.Failure as e:
209+
print(f"XenAPI call failed: {e.details}")
210+
sys.exit(1)
211+
finally:
212+
session.xenapi.session.logout()
213+
214+
215+
if __name__ == "__main__":
216+
main()
217+
```
218+

0 commit comments

Comments
 (0)