Skip to content

Commit f260d28

Browse files
authored
Merge pull request #2962 from sebastiankuttnig/failover
failover.c - UPS Failover Driver
2 parents 0eb2045 + fb3cac7 commit f260d28

File tree

11 files changed

+3009
-28
lines changed

11 files changed

+3009
-28
lines changed

NEWS.adoc

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,16 @@ https://github.com/networkupstools/nut/milestone/9
169169
This seems to be a protocol developed by Cyber Energy for serial-port
170170
devices, subsequently used by different vendors in their own products
171171
or re-branded Cyber Energy creations. [#2940]
172+
* Introduced a `failover` driver for monitoring multiple UPS driver sockets
173+
and seamless switching out of UPS data in a failover situation, includes
174+
support for end-to-end tracked instant commands and also variable updating.
175+
[#2962]
176+
177+
- The `nut-driver-enumerator.sh` script (NDE) now internally tracks dependency
178+
of one driver on another one that should be locally running to serve the
179+
"original" data points (`clone`, `clone-outlet`, `dummy-ups`, `failover`).
180+
It should create soft dependencies between respective service instances
181+
to order their start-up sequence. [#2962]
172182

173183
- NUT Monitor GUI:
174184
* Ported Python 3 version to Qt6, now shipped alongside Qt5 for systems

docs/man/Makefile.am

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -857,6 +857,7 @@ SRC_SERIAL_PAGES = \
857857
clone.txt \
858858
clone-outlet.txt \
859859
dummy-ups.txt \
860+
failover.txt \
860861
etapro.txt \
861862
everups.txt \
862863
gamatronic.txt \
@@ -909,6 +910,7 @@ INST_MAN_SERIAL_PAGES = \
909910
dummy-ups.$(MAN_SECTION_CMD_SYS) \
910911
etapro.$(MAN_SECTION_CMD_SYS) \
911912
everups.$(MAN_SECTION_CMD_SYS) \
913+
failover.$(MAN_SECTION_CMD_SYS) \
912914
gamatronic.$(MAN_SECTION_CMD_SYS) \
913915
genericups.$(MAN_SECTION_CMD_SYS) \
914916
isbmex.$(MAN_SECTION_CMD_SYS) \
@@ -976,6 +978,7 @@ INST_HTML_SERIAL_MANS = \
976978
dummy-ups.html \
977979
etapro.html \
978980
everups.html \
981+
failover.html \
979982
gamatronic.html \
980983
genericups.html \
981984
isbmex.html \

docs/man/failover.txt

Lines changed: 296 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,296 @@
1+
FAILOVER(8)
2+
==========
3+
4+
NAME
5+
----
6+
7+
failover - UPS Failover Driver
8+
9+
SYNOPSIS
10+
--------
11+
12+
*failover* -h
13+
14+
*failover* -a 'UPS_NAME' ['OPTIONS']
15+
16+
NOTE: This man page only documents the specific features of the failover driver.
17+
For information about the core driver, see linkman:nutupsdrv[8].
18+
19+
DESCRIPTION
20+
-----------
21+
22+
The `failover` driver acts as a smart proxy for multiple "real" UPS drivers. It
23+
connects to and monitors these underlying UPS drivers through their local UNIX
24+
sockets (or Windows named pipes), continuously evaluating health and suitability
25+
for "primary" duty according to a set of user configurable rules and priorities.
26+
27+
At any given time, `failover` designates one UPS driver as the *primary*, and
28+
presents its commands, variables and status to the outside world as if it were
29+
directly talking to that UPS. From the perspective of the clients (such as
30+
linkman:upsmon[8] or linkman:upsc[8]), the `failover` driver behaves like any
31+
single UPS, abstracting away the underlying redundancy, and allowing for
32+
seamless transitioning between all monitored UPS drivers and their datasets.
33+
34+
The driver dynamically promotes or demotes the primary UPS driver based on:
35+
36+
- Socket availability and communication status
37+
- Data freshness and UPS online/offline indicators
38+
- User-defined status filters (e.g., presence or absence of `OL`, `LB`, ...)
39+
- Administrative override via control commands (`force.primary`, `force.ignore`)
40+
41+
If the current primary becomes unavailable or no longer meets the criteria, the
42+
driver automatically fails over to a more suitable driver. During transitions,
43+
it ensures that any data is switched out instantly, without the linkman:upsd[8]
44+
considering it as stale or the clients acting on any previously degraded status.
45+
46+
When no suitable primary is available, a configurable fallback state is entered:
47+
48+
- Keep last primary and declare the data as stale
49+
- Raise `ALARM` and declare the data as stale
50+
- Raise `ALARM` and set forced shutdown (`FSD`)
51+
52+
Different communication media can be used to connect to individual UPS drivers
53+
(e.g., USB, Serial, Ethernet). `failover` communicates directly at the socket
54+
level and therefore does not rely on linkman:upsd[8] being active.
55+
56+
EXTRA ARGUMENTS
57+
---------------
58+
59+
This driver supports the following settings:
60+
61+
*port*='drivername-devicename,drivername2-devicename2,...'::
62+
Required. Specifies the local sockets (or Windows named pipes) of the underlying
63+
UPS drivers to be tracked. Entries must either be a path or follow the format
64+
`drivername-devicename`, as used by NUT's internal socket naming convention
65+
(e.g. `usbhid-ups-myups`). Multiple entries are comma-separated with no spaces.
66+
67+
*inittime*='seconds'::
68+
Optional. Sets a grace period after driver startup during which the absence of a
69+
primary is tolerated. This allows time for underlying drivers to initialize. For
70+
networked connections or drivers that require "lock-picking" their communication
71+
protocol, consider increasing this value to accommodate potential longer delays.
72+
Defaults to 30 seconds.
73+
74+
*deadtime*='seconds'::
75+
Optional. Sets a grace period in seconds after which a non-responsive UPS driver
76+
is considered dead. Defaults to 30 seconds.
77+
78+
*relogtime*='seconds'::
79+
Optional. Time interval in which repeated connection failure logs are emitted
80+
for a UPS, reducing log spam during unstable conditions. Defaults to 5 seconds.
81+
82+
*noprimarytime*='seconds'::
83+
Optional. Duration to wait without a suitable primary UPS driver before entering
84+
the configured fallback mode (`fsdmode`). Defaults to 15 seconds.
85+
86+
*maxconnfails*='count'::
87+
Optional. Number of consecutive connection failures allowed per UPS driver
88+
before entering into the cooldown period (`coolofftime`). Defaults to 5.
89+
90+
*coolofftime*='seconds'::
91+
Optional. Cooldown period during which the driver pauses reconnect attempts
92+
after exceeding `maxconnfails`. Defaults to 15 seconds.
93+
94+
*fsdmode*='0|1|2'::
95+
Optional. Defines the behavior when no suitable primary UPS driver is found
96+
after `noprimarytime` has elapsed. Defaults to 0.
97+
98+
- `0`: *Do not demote the last primary, but mark its data as stale.* This is
99+
similar to how a regular UPS driver would behave when it loses its connection to
100+
the target UPS device. linkman:upsmon[8] will act on the last known (online or
101+
not) status, and decide itself whether that UPS should be considered critical.
102+
103+
- `1`: *Demote the primary, raise `ALARM`, and mark the data as stale after an
104+
additional few seconds have elapsed (ensuring full propagation).* This will
105+
cause linkman:upsmon[8] to detect that a device previously in an alarm state has
106+
lost its connection, consider the UPS driver critical, and possibly trigger a
107+
forced shutdown (`FSD`) due to depletion of `MINSUPPLIES`.
108+
109+
- `2`: *Demote the primary, raise `ALARM`, and immediately set `FSD`.* This will
110+
set `FSD` from the driver side and preempt linkman:upsmon[8] from raising it
111+
itself. This mode is for setups where immediate shutdown is warranted,
112+
regardless of anything else, and getting `FSD` out to the clients as fast as
113+
just possible.
114+
115+
*checkruntime*='0|1|2|3'::
116+
Optional. Controls how `battery.runtime` values are used to break ties between
117+
non-fully-online UPS devices **at priority 3 or lower**. Has no effect on
118+
initial priority selection or when `strictfiltering` is enabled. Defaults to 1.
119+
120+
- `0`: *Disabled.* No runtime comparison is done. The first candidate with the
121+
best priority is selected according to the order of the port argument.
122+
123+
- `1`: *Compare `battery.runtime`.* The UPS with the higher value is preferred.
124+
If the value is missing or invalid, the UPS cannot win the tie-break.
125+
126+
- `2`: *Compare `battery.runtime.low`.* The UPS with the higher value is
127+
preferred. If the value is missing or invalid, the UPS cannot win the tie-break.
128+
129+
- `3`: *Compare both variables strictly.* The UPS is preferred only if it has
130+
both a higher `battery.runtime` and `battery.runtime.low` value. If either is
131+
missing or invalid, the UPS cannot win the tie-break.
132+
133+
*strictfiltering*='0|1':: Optional. If set to 1, only UPS drivers matching the
134+
configured status filters are considered for promotion to primary. If set to 0,
135+
the hard-coded default logic is also considered when no status filters match
136+
(read more about this in the section `PRIORITIES`). Defaults to 0.
137+
138+
*status_have_any*='OL,CHRG,...'::
139+
Optional. If any of these comma-separated tokens are present in a UPS driver's
140+
`ups.status`, it passes this status filtering criteria. Defaults to unset.
141+
142+
*status_have_all*='OL,CHRG,...'::
143+
Optional. All listed comma-separated tokens must be present in `ups.status` for
144+
the UPS driver to pass this status filtering criteria. Defaults to unset.
145+
146+
*status_nothave_any*='OB,OFF,...'::
147+
Optional. If any of these comma-separated tokens are present in `ups.status`,
148+
the UPS driver does not pass this status filtering criteria. Defaults to unset.
149+
150+
*status_nothave_all*='OB,LB,...'::
151+
Optional. If all of these comma-separated tokens are present in `ups.status`,
152+
the UPS driver does not pass this status filtering criteria. Defaults to unset.
153+
154+
NOTE: The `status_*` arguments are primarily intended to adjust the weighting of
155+
UPS drivers, allowing some to be prioritized over others based on their status.
156+
For example, a driver reporting `OL` might be preferred over one reporting
157+
`ALARM OL`. While `strictfiltering` can be enabled, status filters are most
158+
effective when used in combination with the default set of connectivity-based
159+
`PRIORITIES`. For more details, see the respective section further below.
160+
161+
IMPLEMENTATION
162+
--------------
163+
164+
The port argument in the linkman:ups.conf[5] should reference the local driver
165+
sockets (or Windows named pipes) that the "real" UPS drivers are using. A basic
166+
default setup with multiple drivers could look like this:
167+
168+
------
169+
[realups]
170+
driver = usbhid-ups
171+
port = auto
172+
173+
[realups2]
174+
driver = usbhid-ups
175+
port = auto
176+
177+
[failover]
178+
driver = failover
179+
port = usbhid-ups-realups,usbhid-ups-realups2
180+
------
181+
182+
Any linkman:upsmon[8] clients would be set to monitor the `failover` UPS.
183+
184+
The driver fully supports setting variables and performing instant commands on
185+
the currently elected primary UPS driver, which are proxied and with end-to-end
186+
tracking also being possible (linkman:upscmd[1] and linkman:upsrw[1] `-w`). You
187+
may notice some variables and commands will be prefixed with `upstream.`, this
188+
is to clearly separate the upstream commands from those of `failover` itself.
189+
190+
For your convenience, additional administrative commands are exposed to directly
191+
influence and override the primary election process, e.g. for maintenance:
192+
193+
- `<socketname>.force.ignore [seconds]` prevents the specified UPS driver from
194+
being selected as primary for the given duration, or permanently if a negative
195+
value is used. A value of `0` resets this override and re-enables selection.
196+
197+
- `<socketname>.force.primary [seconds]` forces the specified UPS driver to be
198+
treated with the highest priority for the given duration, or permanently if a
199+
negative value is used. A value of `0` resets this override.
200+
201+
Calling either command without an argument has the same effect as passing `0`,
202+
but only for that specific override - it does not affect the other.
203+
204+
PRIORITIES
205+
----------
206+
207+
As outlined above, primaries are dynamically elected based on their current
208+
state and according to a strict set of user influenceable priorities, which are:
209+
210+
- `0` (highest): UPS driver was forced to the top by administrative command.
211+
- `1`: UPS driver has passed the user-defined status filters.
212+
- `2`: UPS driver has fresh data and is online (in status `OL`).
213+
- `3`: UPS driver has fresh data, but may not be fully online.
214+
- `4` (lowest): UPS driver is alive, but may not have fresh data.
215+
216+
The UPS driver with the highest calculated priority is chosen as primary, ties
217+
are resolved through order of the socket names given within the `port` argument.
218+
219+
For the user-defined status filters, the following internal order is respected:
220+
221+
1. `status_nothave_any` (first)
222+
2. `status_have_all`
223+
3. `status_nothave_all`
224+
4. `status_have_any` (last)
225+
226+
If `strictfiltering` is enabled, priorities 2 to 4 are not applicable.
227+
228+
If no user-defined status filters are set, the priority 1 is not applicable.
229+
230+
NOTE: The base requirement for any election is the UPS socket being connectable
231+
and the UPS driver having published at least one full batch of data during its
232+
lifetime. UPS driver not fulfilling that requirement are always disqualified.
233+
234+
RATIONALE
235+
---------
236+
237+
In complex power environments, presenting a single, consistent source of UPS
238+
information to linkman:upsmon[8] is sometimes preferable to monitoring multiple
239+
independent drivers directly. The `failover` driver serves as a bridge, allowing
240+
linkman:upsmon[8] to make decisions based on the most suitable available data,
241+
without having to interpret conflicting inputs or degraded sources.
242+
243+
Originally designed for use cases such as dual-PSU systems or redundant
244+
communication paths to a single UPS, `failover` also supports more advanced
245+
setups - for example, when multiple UPSes feed a shared downstream load (via
246+
STS/ATS switches), or when drivers vary in reliability. In these cases, the
247+
driver can be combined with external logic or scripting to dynamically adjust
248+
primary selection and facilitate graceful degradation. Such setups may also
249+
benefit from further integration with the `clone` family of drivers, such as
250+
linkman:clone[8] or linkman:clone-outlet[8], for greater granularity and
251+
monitoring control down to the outlet level.
252+
253+
Additionally, in more niche scenarios, some third-party NUT integrations or
254+
graphical interfaces may be limited to monitoring a single UPS device. In such
255+
cases, `failover` can help by exposing only the most relevant or
256+
highest-priority data source, allowing those tools to operate within their
257+
constraints without missing critical information.
258+
259+
Ultimately, this driver enables more nuanced power monitoring and control than
260+
binary online/offline logic alone, allowing administrators to respond to
261+
degraded conditions early - before they escalate into critical events or require
262+
linkman:upsmon[8] to take action.
263+
264+
LIMITATIONS
265+
-----------
266+
267+
When using `failover` for redundancy between multiple UPS drivers connected to
268+
the same underlying UPS device, data is not multiplexed between the drivers. As
269+
a result, some data points may be available in some drivers but not in others.
270+
271+
For `checkruntime` considerations, the unit of both `battery.runtime` and
272+
`battery.runtime.low` is assumed to be **seconds**. UPS drivers that report
273+
these values using different units are considered non-compliant with the NUT
274+
variable standards and should be reported to the NUT developers as faulty.
275+
276+
AUTHOR
277+
------
278+
279+
Sebastian Kuttnig <[email protected]>
280+
281+
SEE ALSO
282+
--------
283+
284+
linkman:upscmd[1],
285+
linkman:upsrw[1],
286+
linkman:ups.conf[5],
287+
linkman:upsc[8],
288+
linkman:upsmon[8],
289+
linkman:nutupsdrv[8],
290+
linkman:clone[8],
291+
linkman:clone-outlet[8]
292+
293+
Internet Resources:
294+
~~~~~~~~~~~~~~~~~~~
295+
296+
The NUT (Network UPS Tools) home page: https://www.networkupstools.org/

docs/man/nut-driver-enumerator.txt

Lines changed: 27 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -19,16 +19,24 @@ SYNOPSIS
1919
DESCRIPTION
2020
-----------
2121

22-
*nut-driver-enumerator.sh* implements the set-up and querying of the
23-
mapping between NUT driver configuration sections for each individual
24-
monitored device, and the operating system service management framework
25-
service instances into which such drivers are wrapped for independent
26-
execution and management (on platforms where NUT currently supports
27-
this integration -- currently this covers Linux distributions with
28-
systemd and systems derived from Solaris 10 codebase, including
29-
proprietary Sun/Oracle Solaris and numerous open-source illumos
30-
distributions with SMF). It may be not installed in packaging for
31-
other operating systems.
22+
The *nut-driver-enumerator.sh* (also known as "NDE") script implements the
23+
set-up and querying of the mapping between NUT driver configuration sections
24+
for each individual monitored device, and the service instances of an
25+
operating system service management framework (on platforms where NUT already
26+
supports this integration -- currently this covers Linux distributions with
27+
systemd and systems derived from Solaris 10 codebase, including proprietary
28+
Sun/Oracle Solaris and numerous open-source illumos distributions with SMF),
29+
into which such drivers are wrapped for independent execution and management.
30+
It may be not installed in packaging for other operating systems.
31+
32+
With each NUT driver represented as a separate service instance, dependencies
33+
can be defined (e.g. networked drivers must start after the network ability
34+
appears in the OS, but USB/Serial drivers should not wait for that), and they
35+
can fail or be brought into maintenance independently (unlike a monolithic
36+
service based on linkman:upsdrvctl[8] requiring everything configured to be
37+
started). For a few special drivers like linkman:dummy-ups[8], linkman:clone[8],
38+
linkman:clone-outlet[8], and linkman:failover[8] this may also involve a
39+
dependency between service instances of different NUT drivers themselves.
3240

3341
This script provides a uniform interface for further NUT tools
3442
such as linkman:upsdrvsvcctl[8] to implement their logic as
@@ -42,6 +50,15 @@ hides is the difference of rules for valid service instance names
4250
in various frameworks, as well as system tools and naming patterns
4351
involved.
4452

53+
Depending on the platform, the script may also be wrapped by different service
54+
unit types to run automatically (e.g. upon system start-up, or regularly to
55+
pick up changes of linkman:ups.conf[5] soon after it is edited, or integrated
56+
with a file system monitor to be triggered when the configuration is changed).
57+
Some of these modes make sense for use-cases with a rarely (if ever) changing
58+
population of power devices, e.g. a home or small-office UPS monitored same
59+
way for years at a time; others can help automate a data-center monitoring
60+
system where device deployments (or discovery) can be much more dynamic.
61+
4562
COMMANDS
4663
--------
4764

0 commit comments

Comments
 (0)