|
| 1 | ++++ |
| 2 | +title = "How to set up alarms" |
| 3 | +linkTitle = "Alarms" |
| 4 | ++++ |
| 5 | + |
| 6 | +# Introduction |
| 7 | + |
| 8 | +In XAPI, alarms are triggered by a Python daemon located at `/opt/xensource/bin/perfmon`. |
| 9 | +The daemon is managed as a systemd service and can be configured by setting parameters in `/etc/sysconfig/perfmon`. |
| 10 | + |
| 11 | +It listens on an internal Unix socket to receive commands. Otherwise, it runs in a loop, periodically requesting metrics from XAPI. It can then be configured to generate events based on these metrics. It can monitor various types of XAPI objects, including `VMs`, `SRs`, and `Hosts`. The configuration for each object is defined by writing an XML string into the object's `other-config` key. |
| 12 | + |
| 13 | +The metrics used by `perfmon` are collected by the `xcp-rrdd` daemon. The `xcp-rrdd` daemon is a component of XAPI responsible for collecting metrics and storing them as Round-Robin Databases (RRDs). |
| 14 | + |
| 15 | +A XAPI plugin also exists, providing the functions `refresh` and `debug_mem`, which send commands through the Unix socket. The `refresh` function is used when an `other-config` key is added or updated; it triggers the daemon to reread the monitored objects so that new alerts are taken into account. The `debug_mem` function logs the objects currently being monitored into `/var/log/user.log` as a dictionary. |
| 16 | + |
| 17 | +# Monitoring and alarms |
| 18 | + |
| 19 | +## Overview |
| 20 | + |
| 21 | +- To get the metrics, `perfmon` requests XAPI by calling: `http://localhost/rrd_updates?session_id=<ref>&start=1759912021&host=true&sr_uuid=all&cf=AVERAGE&interval=60` |
| 22 | +- Different consolidation functions can be used like **AVERAGE**, **MIN**, **MAX** or **LAST**. See the details in the next sections for specific objects and how to set it. |
| 23 | +- Once retrieve, `perfmon` will check all its triggers and generate alarms if needed. |
| 24 | + |
| 25 | +## Specific XAPI objects |
| 26 | +### VMs |
| 27 | + |
| 28 | +- To set an alarm on a VM, you need to write an XML string into the `other-config` key of the object. For example, to trigger an alarm when the CPU usage is higher than 50%, run: |
| 29 | +```sh |
| 30 | +xe vm-param-set uuid=<UUID> other-config:perfmon='<config> <variable> <name value="cpu_usage"/> <alarm_trigger_level value="0.5"/> </variable> </config>' |
| 31 | +``` |
| 32 | + |
| 33 | +- Then, you can either wait until the new configuration is read by the `perfmon` daemon or force a refresh by running: |
| 34 | +```sh |
| 35 | +xe host-call-plugin host-uuid=<UUID> plugin=perfmon fn=refresh |
| 36 | +``` |
| 37 | + |
| 38 | +- Now, if you generate some load inside the VM and the CPU usage goes above 50%, the `perfmon` daemon will create a message (a XAPI object) with the name **ALARM**. This message will include a _priority_, a _timestamp_, an _obj-uuid_ and a _body_. To list all messages that are alarms, run: |
| 39 | +```sh |
| 40 | +xe message-list name=ALARM |
| 41 | +``` |
| 42 | + |
| 43 | +- You will see, for example: |
| 44 | +```sh |
| 45 | +uuid ( RO) : dadd7cbc-cb4e-5a56-eb0b-0bb31c102c94 |
| 46 | + name ( RO): ALARM |
| 47 | + priority ( RO): 3 |
| 48 | + class ( RO): VM |
| 49 | + obj-uuid ( RO): ea9efde2-d0f2-34bb-74cb-78c303f65d89 |
| 50 | + timestamp ( RO): 20251007T11:30:26Z |
| 51 | + body ( RO): value: 0.986414 |
| 52 | +config: |
| 53 | +<variable> |
| 54 | + |
| 55 | + <name value="cpu_usage"/> |
| 56 | + |
| 57 | + <alarm_trigger_level value="0.5"/> |
| 58 | + |
| 59 | +</variable> |
| 60 | +``` |
| 61 | +- where the _body_ contains all the relevant information: the value that triggered the alarm and the configuration of your alarm. |
| 62 | + |
| 63 | +- When configuring you alarm, your XML string can: |
| 64 | + - have multiple `<variable>` nodes |
| 65 | + - use the following values for child nodes: |
| 66 | + * **name**: what to call the variable (no default) |
| 67 | + * **alarm_priority**: the priority of the messages generated (default '3') |
| 68 | + * **alarm_trigger_level**: level of value that triggers an alarm (no default) |
| 69 | + * **alarm_trigger_sense**:'high' if alarm_trigger_level is a max, otherwise 'low'. (default 'high') |
| 70 | + * **alarm_trigger_period**: num seconds of 'bad' values before an alarm is sent (default '60') |
| 71 | + * **alarm_auto_inhibit_period**: num seconds this alarm disabled after an alarm is sent (default '3600') |
| 72 | + * **consolidation_fn**: how to combine variables from rrd_updates into one value (default is 'average' for 'cpu_usage', 'get_percent_fs_usage' for 'fs_usage', 'get_percent_log_fs_usage' for 'log_fs_usage','get_percent_mem_usage' for 'mem_usage', & 'sum' for everything else) |
| 73 | + * **rrd_regex** matches the names of variables from (xe vm-data-sources-list uuid=$vmuuid) used to compute value (only has defaults for "cpu_usage", "network_usage", and "disk_usage") |
| 74 | + |
| 75 | +- Notice that `alarm_priority` will be the priority of the generated `message`, 0 being low priority. |
| 76 | + |
| 77 | +### SRs |
| 78 | + |
| 79 | +- To set an alarm on an SR object, as with VMs, you need to write an XML string into the `other-config` key of the SR. For example, you can run: |
| 80 | +```sh |
| 81 | +xe sr-param-set uuid=<UUID> other-config:perfmon='<config><variable><name value="physical_utilisation"/><alarm_trigger_level value="0.8"/></variable></config>' |
| 82 | +``` |
| 83 | +- When configuring you alarm, the XML string supports the same child elements as for VMs |
| 84 | + |
| 85 | +### Hosts |
| 86 | + |
| 87 | +- As with VMs ans SRs, alarms can be configured by writing an XML string into an `other-config` key. For example, you can run: |
| 88 | +```sh |
| 89 | +xe host-param-set uuid=<UUID> other-config:perfmon=\ |
| 90 | + '<config><variable><name value="cpu_usage"/><alarm_trigger_level value="0.5"/></variable></config>' |
| 91 | +``` |
| 92 | + |
| 93 | +- The XML string can include multiple <variable> nodes allowed |
| 94 | +- The full list of supported child nodes is: |
| 95 | + * **name**: what to call the variable (no default) |
| 96 | + * **alarm_priority**: the priority of the messages generated (default '3') |
| 97 | + * **alarm_trigger_level**: level of value that triggers an alarm (no default) |
| 98 | + * **alarm_trigger_sense**: 'high' if alarm_trigger_level is a max, otherwise 'low'. (default 'high') |
| 99 | + * **alarm_trigger_period**: num seconds of 'bad' values before an alarm is sent (default '60') |
| 100 | + * **alarm_auto_inhibit_period**:num seconds this alarm disabled after an alarm is sent (default '3600') |
| 101 | + * **consolidation_fn**: how to combine variables from rrd_updates into one value (default is 'average' for 'cpu_usage' & 'sum' for everything else) |
| 102 | + * **rrd_regex** matches the names of variables from (xe host-data-source-list uuid=<UUID>) used to compute value (only has defaults for "cpu_usage", "network_usage", "memory_free_kib" and "sr_io_throughput_total_xxxxxxxx") where that last one ends with the first eight characters of the SR UUID) |
| 103 | + |
| 104 | +- As a special case for SR throughput, it is also possible to configure a Host by writing XML into the `other-config` key of an SR connected to it. For example: |
| 105 | +```sh |
| 106 | +xe sr-param-set uuid=$sruuid other-config:perfmon=\ |
| 107 | + '<config><variable><name value="sr_io_throughput_total_per_host"/><alarm_trigger_level value="0.01"/></variable></config>' |
| 108 | +``` |
| 109 | +- This only works for that specific variable name, and `rrd_regex` must not be specified. |
| 110 | +- Configuration done directly on the host (variable-name, sr_io_throughput_total_xxxxxxxx) takes priority. |
| 111 | + |
| 112 | +## Which metrics are available? |
| 113 | + |
| 114 | +- Accepted name for metrics are: |
| 115 | + - **cpu_usage**: matches RRD metrics with the pattern `cpu[0-9]+` |
| 116 | + - **network_usage**: matches RRD metrics with the pattern `vif_[0-9]+_[rt]x` |
| 117 | + - **disk_usage**: match RRD metrics with the pattern `vbd_(xvd|hd)[a-z]+_(read|write)` |
| 118 | + - **fs_usage**, **log_fs_usage**, **mem_usage** and **memory_internal_free** do not match anything by default. |
| 119 | +- By using `rrd_regex`, you can add your own expressions. To get a list of available metrics with their descriptions, you can call the `get_data_sources` method for [VM](https://xapi-project.github.io/new-docs/xen-api/classes/vm/), for [SR](https://xapi-project.github.io/new-docs/xen-api/classes/sr/) and also for [Host](https://xapi-project.github.io/new-docs/xen-api/classes/host/). |
| 120 | +- A python script is provided at the end to get data sources. Using the script we can, for example, see: |
| 121 | +```sh |
| 122 | +# ./get_data_sources.py --vm 5a445deb-0a8e-c6fe-24c8-09a0508bbe21 |
| 123 | + |
| 124 | +List of data sources related to VM 5a445deb-0a8e-c6fe-24c8-09a0508bbe21 |
| 125 | +cpu0 | CPU0 usage |
| 126 | +cpu_usage | Domain CPU usage |
| 127 | +memory | Memory currently allocated to VM |
| 128 | +memory_internal_free | Memory used as reported by the guest agent |
| 129 | +memory_target | Target of VM balloon driver |
| 130 | +... |
| 131 | +vbd_xvda_io_throughput_read | Data read from the VDI, in MiB/s |
| 132 | +... |
| 133 | +``` |
| 134 | +- You can then set up an alarm when the data read from a VDI exceeds a certain level by doing: |
| 135 | +``` |
| 136 | +xe vm-param-set uuid=5a445deb-0a8e-c6fe-24c8-09a0508bbe21 \ |
| 137 | + other-config:perfmon='<config><variable> \ |
| 138 | + <name value="disk_usage"/> \ |
| 139 | + <alarm_trigger_level value="10"/> \ |
| 140 | + <rrd_regex value="vbd_xvda_io_throughput_read"/> \ |
| 141 | + </variable> </config>' |
| 142 | +``` |
| 143 | +- Here is the script that allows you to get data sources: |
| 144 | +```python |
| 145 | +#!/usr/bin/env python3 |
| 146 | + |
| 147 | +import argparse |
| 148 | +import sys |
| 149 | +import XenAPI |
| 150 | + |
| 151 | + |
| 152 | +def pretty_print(data_sources): |
| 153 | + if not data_sources: |
| 154 | + print("No data sources.") |
| 155 | + return |
| 156 | + |
| 157 | + # Compute alignment for something nice |
| 158 | + max_label_len = max(len(data["name_label"]) for data in data_sources) |
| 159 | + |
| 160 | + for data in data_sources: |
| 161 | + label = data["name_label"] |
| 162 | + desc = data["name_description"] |
| 163 | + print(f"{label:<{max_label_len}} | {desc}") |
| 164 | + |
| 165 | + |
| 166 | +def list_vm_data(session, uuid): |
| 167 | + vm_ref = session.xenapi.VM.get_by_uuid(uuid) |
| 168 | + data_sources = session.xenapi.VM.get_data_sources(vm_ref) |
| 169 | + print(f"\nList of data sources related to VM {uuid}") |
| 170 | + pretty_print(data_sources) |
| 171 | + |
| 172 | + |
| 173 | +def list_host_data(session, uuid): |
| 174 | + host_ref = session.xenapi.host.get_by_uuid(uuid) |
| 175 | + data_sources = session.xenapi.host.get_data_sources(host_ref) |
| 176 | + print(f"\nList of data sources related to Host {uuid}") |
| 177 | + pretty_print(data_sources) |
| 178 | + |
| 179 | + |
| 180 | +def list_sr_data(session, uuid): |
| 181 | + sr_ref = session.xenapi.SR.get_by_uuid(uuid) |
| 182 | + data_sources = session.xenapi.SR.get_data_sources(sr_ref) |
| 183 | + print(f"\nList of data sources related to SR {uuid}") |
| 184 | + pretty_print(data_sources) |
| 185 | + |
| 186 | + |
| 187 | +def main(): |
| 188 | + parser = argparse.ArgumentParser( |
| 189 | + description="List data sources related to VM, host or SR" |
| 190 | + ) |
| 191 | + parser.add_argument("--vm", help="VM UUID") |
| 192 | + parser.add_argument("--host", help="Host UUID") |
| 193 | + parser.add_argument("--sr", help="SR UUID") |
| 194 | + |
| 195 | + args = parser.parse_args() |
| 196 | + |
| 197 | + # Connect to local XAPI: no identification required to access local socket |
| 198 | + session = XenAPI.xapi_local() |
| 199 | + |
| 200 | + try: |
| 201 | + session.xenapi.login_with_password("", "") |
| 202 | + if args.vm: |
| 203 | + list_vm_data(session, args.vm) |
| 204 | + if args.host: |
| 205 | + list_host_data(session, args.host) |
| 206 | + if args.sr: |
| 207 | + list_sr_data(session, args.sr) |
| 208 | + except XenAPI.Failure as e: |
| 209 | + print(f"XenAPI call failed: {e.details}") |
| 210 | + sys.exit(1) |
| 211 | + finally: |
| 212 | + session.xenapi.session.logout() |
| 213 | + |
| 214 | + |
| 215 | +if __name__ == "__main__": |
| 216 | + main() |
| 217 | +``` |
| 218 | + |
0 commit comments