Skip to content

failover.c - UPS Failover Driver #2962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
May 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
8dae3b5
drivers/, docs/: introduce failover driver
sebastiankuttnig May 19, 2025
5855e6e
server/netget.c: rewrite upstream prefix for proxying drivers
sebastiankuttnig May 19, 2025
fe08563
NEWS.adoc: introduce failover driver
sebastiankuttnig May 19, 2025
bf3b9e2
drivers/failover.c, NEWS.adoc: fixes for compiler warnings, add PR nu…
sebastiankuttnig May 19, 2025
961fbc2
drivers/failover.c: add shutdown non-handling
sebastiankuttnig May 19, 2025
6b33157
drivers/failover.c: free parse_port_argument() tmp on premature exit …
sebastiankuttnig May 19, 2025
e5f3631
drivers/failover.c: clean dstate after fsdmode 0, remove now freed va…
sebastiankuttnig May 19, 2025
2fac5e1
drivers/failover.c: preserve which port value failed the argument par…
sebastiankuttnig May 19, 2025
f4d8ab7
drivers/failover.c: make defensive freeing consistent throughout the …
sebastiankuttnig May 19, 2025
2c74214
drivers/failover.c: reword shutdown to be more clear
sebastiankuttnig May 22, 2025
1ad9b26
drivers/failover.c: use NUT_STRARG helper for null checks in various …
sebastiankuttnig May 23, 2025
683559c
drivers/failover.c: use enum for priorities
sebastiankuttnig May 23, 2025
fb3d251
drivers/failover.{c,h}: introduce failover.h for defines, typedefs
sebastiankuttnig May 23, 2025
18ac174
drivers/failover.c: do not fatalx on no connectable drivers, keep try…
sebastiankuttnig May 23, 2025
be52d0b
drivers/failover.c: remove redundant _init() calls for status/alarm
sebastiankuttnig May 23, 2025
f3b1541
drivers/failover.c: show truncation content at end of log message
sebastiankuttnig May 23, 2025
53359b7
drivers/failover.c: safeguard ups_promote_primary against NULL or dou…
sebastiankuttnig May 23, 2025
9f78e4c
docs/man/failover.txt: polish documentation and add rationale
sebastiankuttnig May 23, 2025
022342c
drivers/failover.c: remove progname from non fatal log message
sebastiankuttnig May 23, 2025
a6fda90
docs/man/failover.txt: make hyphens consistent
sebastiankuttnig May 23, 2025
22c8465
docs/man/failover.txt: add note to factor in network or lock-picking …
sebastiankuttnig May 23, 2025
44cabf3
docs/man/failover.txt: add limitations
sebastiankuttnig May 23, 2025
b799a9d
Merge branch 'master' into failover
jimklimov May 23, 2025
5321a55
docs/man/failover.txt: add 3rd party tool use case for rationale
sebastiankuttnig May 23, 2025
953d368
drivers/Makefile.am: add failover.h for dists
sebastiankuttnig May 24, 2025
12e7da4
docs/man/failover.txt: fix incompatible characters
sebastiankuttnig May 24, 2025
825edd2
drivers/failover.c: safer string to numeric conversions, improved arg…
sebastiankuttnig May 26, 2025
2c68099
drivers/failover.c: remove magic -1 from str_arg_to_int(), use INT_MI…
sebastiankuttnig May 26, 2025
95ea397
scripts/upsdrvsvcctl/nut-driver-enumerator.sh.in: add support for "dr…
jimklimov May 22, 2025
f001318
Merge branch 'failover' of github.com:sebastiankuttnig/nut into failover
sebastiankuttnig May 26, 2025
3625b15
drivers/failover.c: make csv_arg_to_array() more reusable
sebastiankuttnig May 26, 2025
36204e8
scripts/upsdrvsvcctl/nut-driver-enumerator.sh.in: report if other dev…
jimklimov May 26, 2025
962bac4
drivers/failover.c: use str_to_int() also in instcmd()
sebastiankuttnig May 26, 2025
505853f
Merge branch 'master' into failover
sebastiankuttnig May 28, 2025
784ce12
drivers/failover.{c,h}, docs/man/failover.txt: use _sockfn() for one-…
sebastiankuttnig May 28, 2025
ff08659
tests/nut-driver-enumerator-test.sh: reflect recent enumerator change…
sebastiankuttnig May 28, 2025
54c604b
NEWS.adoc: mention NDE change to track inter-driver dependency [#2962]
jimklimov May 28, 2025
89a0db0
docs/man/nut-driver-enumerator.txt: update intro, mention driver-on-d…
jimklimov May 28, 2025
ccaac91
drivers/failover.{c,h}: introduce checkruntime argument
sebastiankuttnig May 28, 2025
c498dd3
docs/man/failover.txt, docs/nut.dict: introduce checkruntime argument
sebastiankuttnig May 28, 2025
d8f171c
drivers/failover.c: minor improvements to order and debug levels
sebastiankuttnig May 28, 2025
9a6f56c
drivers/failover.{c,h}: store runtimes in UPS struct
sebastiankuttnig May 28, 2025
0937f25
drivers/failover.c: improve guarding of ups->status against NULL dere…
sebastiankuttnig May 28, 2025
fb3cac7
drivers/failover.{c,h}: make UPS priorities more readable in code
sebastiankuttnig May 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions NEWS.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,16 @@ https://github.com/networkupstools/nut/milestone/9
This seems to be a protocol developed by Cyber Energy for serial-port
devices, subsequently used by different vendors in their own products
or re-branded Cyber Energy creations. [#2940]
* Introduced a `failover` driver for monitoring multiple UPS driver sockets
and seamless switching out of UPS data in a failover situation, includes
support for end-to-end tracked instant commands and also variable updating.
[#2962]

- The `nut-driver-enumerator.sh` script (NDE) now internally tracks dependency
of one driver on another one that should be locally running to serve the
"original" data points (`clone`, `clone-outlet`, `dummy-ups`, `failover`).
It should create soft dependencies between respective service instances
to order their start-up sequence. [#2962]

- NUT Monitor GUI:
* Ported Python 3 version to Qt6, now shipped alongside Qt5 for systems
Expand Down
3 changes: 3 additions & 0 deletions docs/man/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -857,6 +857,7 @@ SRC_SERIAL_PAGES = \
clone.txt \
clone-outlet.txt \
dummy-ups.txt \
failover.txt \
etapro.txt \
everups.txt \
gamatronic.txt \
Expand Down Expand Up @@ -909,6 +910,7 @@ INST_MAN_SERIAL_PAGES = \
dummy-ups.$(MAN_SECTION_CMD_SYS) \
etapro.$(MAN_SECTION_CMD_SYS) \
everups.$(MAN_SECTION_CMD_SYS) \
failover.$(MAN_SECTION_CMD_SYS) \
gamatronic.$(MAN_SECTION_CMD_SYS) \
genericups.$(MAN_SECTION_CMD_SYS) \
isbmex.$(MAN_SECTION_CMD_SYS) \
Expand Down Expand Up @@ -976,6 +978,7 @@ INST_HTML_SERIAL_MANS = \
dummy-ups.html \
etapro.html \
everups.html \
failover.html \
gamatronic.html \
genericups.html \
isbmex.html \
Expand Down
296 changes: 296 additions & 0 deletions docs/man/failover.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,296 @@
FAILOVER(8)
==========

NAME
----

failover - UPS Failover Driver

SYNOPSIS
--------

*failover* -h

*failover* -a 'UPS_NAME' ['OPTIONS']

NOTE: This man page only documents the specific features of the failover driver.
For information about the core driver, see linkman:nutupsdrv[8].

DESCRIPTION
-----------

The `failover` driver acts as a smart proxy for multiple "real" UPS drivers. It
connects to and monitors these underlying UPS drivers through their local UNIX
sockets (or Windows named pipes), continuously evaluating health and suitability
for "primary" duty according to a set of user configurable rules and priorities.

At any given time, `failover` designates one UPS driver as the *primary*, and
presents its commands, variables and status to the outside world as if it were
directly talking to that UPS. From the perspective of the clients (such as
linkman:upsmon[8] or linkman:upsc[8]), the `failover` driver behaves like any
single UPS, abstracting away the underlying redundancy, and allowing for
seamless transitioning between all monitored UPS drivers and their datasets.

The driver dynamically promotes or demotes the primary UPS driver based on:

- Socket availability and communication status
- Data freshness and UPS online/offline indicators
- User-defined status filters (e.g., presence or absence of `OL`, `LB`, ...)
- Administrative override via control commands (`force.primary`, `force.ignore`)

If the current primary becomes unavailable or no longer meets the criteria, the
driver automatically fails over to a more suitable driver. During transitions,
it ensures that any data is switched out instantly, without the linkman:upsd[8]
considering it as stale or the clients acting on any previously degraded status.

When no suitable primary is available, a configurable fallback state is entered:

- Keep last primary and declare the data as stale
- Raise `ALARM` and declare the data as stale
- Raise `ALARM` and set forced shutdown (`FSD`)

Different communication media can be used to connect to individual UPS drivers
(e.g., USB, Serial, Ethernet). `failover` communicates directly at the socket
level and therefore does not rely on linkman:upsd[8] being active.

EXTRA ARGUMENTS
---------------

This driver supports the following settings:

*port*='drivername-devicename,drivername2-devicename2,...'::
Required. Specifies the local sockets (or Windows named pipes) of the underlying
UPS drivers to be tracked. Entries must either be a path or follow the format
`drivername-devicename`, as used by NUT's internal socket naming convention
(e.g. `usbhid-ups-myups`). Multiple entries are comma-separated with no spaces.

*inittime*='seconds'::
Optional. Sets a grace period after driver startup during which the absence of a
primary is tolerated. This allows time for underlying drivers to initialize. For
networked connections or drivers that require "lock-picking" their communication
protocol, consider increasing this value to accommodate potential longer delays.
Defaults to 30 seconds.

*deadtime*='seconds'::
Optional. Sets a grace period in seconds after which a non-responsive UPS driver
is considered dead. Defaults to 30 seconds.

*relogtime*='seconds'::
Optional. Time interval in which repeated connection failure logs are emitted
for a UPS, reducing log spam during unstable conditions. Defaults to 5 seconds.

*noprimarytime*='seconds'::
Optional. Duration to wait without a suitable primary UPS driver before entering
the configured fallback mode (`fsdmode`). Defaults to 15 seconds.

*maxconnfails*='count'::
Optional. Number of consecutive connection failures allowed per UPS driver
before entering into the cooldown period (`coolofftime`). Defaults to 5.

*coolofftime*='seconds'::
Optional. Cooldown period during which the driver pauses reconnect attempts
after exceeding `maxconnfails`. Defaults to 15 seconds.

*fsdmode*='0|1|2'::
Optional. Defines the behavior when no suitable primary UPS driver is found
after `noprimarytime` has elapsed. Defaults to 0.

- `0`: *Do not demote the last primary, but mark its data as stale.* This is
similar to how a regular UPS driver would behave when it loses its connection to
the target UPS device. linkman:upsmon[8] will act on the last known (online or
not) status, and decide itself whether that UPS should be considered critical.

- `1`: *Demote the primary, raise `ALARM`, and mark the data as stale after an
additional few seconds have elapsed (ensuring full propagation).* This will
cause linkman:upsmon[8] to detect that a device previously in an alarm state has
lost its connection, consider the UPS driver critical, and possibly trigger a
forced shutdown (`FSD`) due to depletion of `MINSUPPLIES`.

- `2`: *Demote the primary, raise `ALARM`, and immediately set `FSD`.* This will
set `FSD` from the driver side and preempt linkman:upsmon[8] from raising it
itself. This mode is for setups where immediate shutdown is warranted,
regardless of anything else, and getting `FSD` out to the clients as fast as
just possible.

*checkruntime*='0|1|2|3'::
Optional. Controls how `battery.runtime` values are used to break ties between
non-fully-online UPS devices **at priority 3 or lower**. Has no effect on
initial priority selection or when `strictfiltering` is enabled. Defaults to 1.

- `0`: *Disabled.* No runtime comparison is done. The first candidate with the
best priority is selected according to the order of the port argument.

- `1`: *Compare `battery.runtime`.* The UPS with the higher value is preferred.
If the value is missing or invalid, the UPS cannot win the tie-break.

- `2`: *Compare `battery.runtime.low`.* The UPS with the higher value is
preferred. If the value is missing or invalid, the UPS cannot win the tie-break.
Comment on lines +126 to +127
Copy link
Member

@jimklimov jimklimov May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful this one is for such comparisons, being a fixed value that may be (derived from) a user setting:

Remaining battery runtime when UPS switches to LB (seconds)

It may also be irrelevant if e.g. upssched is used with upsmon for a custom shutdown strategy like "if OB took longer than 5 min to recover from" regardless of battery charge/runtime remaining.

But for some users their (own or device's) setting may be a measure of UPS reliability, so why not.

Copy link
Contributor Author

@sebastiankuttnig sebastiankuttnig May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I understood this as: this is the seconds left (timer) before the UPS switches to LB. Want me to remove it?

Copy link
Member

@jimklimov jimklimov May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can stay, just needed a second glance at usefulness. This can be a factor for some shutdown scenarios - "this UPS will give me a 5-minute FSD window to shut down gracefully (because that's when it becomes OB+LB)", although for specifically shutdowns with upsmon - more likely the "real" driver would be consulted as one of several supplies, than the failover one. But as you said, for single-UPS clients this may still be relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the double check. 👍


- `3`: *Compare both variables strictly.* The UPS is preferred only if it has
both a higher `battery.runtime` and `battery.runtime.low` value. If either is
missing or invalid, the UPS cannot win the tie-break.

*strictfiltering*='0|1':: Optional. If set to 1, only UPS drivers matching the
configured status filters are considered for promotion to primary. If set to 0,
the hard-coded default logic is also considered when no status filters match
(read more about this in the section `PRIORITIES`). Defaults to 0.

*status_have_any*='OL,CHRG,...'::
Optional. If any of these comma-separated tokens are present in a UPS driver's
`ups.status`, it passes this status filtering criteria. Defaults to unset.

*status_have_all*='OL,CHRG,...'::
Optional. All listed comma-separated tokens must be present in `ups.status` for
the UPS driver to pass this status filtering criteria. Defaults to unset.

*status_nothave_any*='OB,OFF,...'::
Optional. If any of these comma-separated tokens are present in `ups.status`,
the UPS driver does not pass this status filtering criteria. Defaults to unset.

*status_nothave_all*='OB,LB,...'::
Optional. If all of these comma-separated tokens are present in `ups.status`,
the UPS driver does not pass this status filtering criteria. Defaults to unset.

NOTE: The `status_*` arguments are primarily intended to adjust the weighting of
UPS drivers, allowing some to be prioritized over others based on their status.
For example, a driver reporting `OL` might be preferred over one reporting
`ALARM OL`. While `strictfiltering` can be enabled, status filters are most
effective when used in combination with the default set of connectivity-based
`PRIORITIES`. For more details, see the respective section further below.

IMPLEMENTATION
--------------

The port argument in the linkman:ups.conf[5] should reference the local driver
sockets (or Windows named pipes) that the "real" UPS drivers are using. A basic
default setup with multiple drivers could look like this:

------
[realups]
driver = usbhid-ups
port = auto

[realups2]
driver = usbhid-ups
port = auto

[failover]
driver = failover
port = usbhid-ups-realups,usbhid-ups-realups2
------

Any linkman:upsmon[8] clients would be set to monitor the `failover` UPS.

The driver fully supports setting variables and performing instant commands on
the currently elected primary UPS driver, which are proxied and with end-to-end
tracking also being possible (linkman:upscmd[1] and linkman:upsrw[1] `-w`). You
may notice some variables and commands will be prefixed with `upstream.`, this
is to clearly separate the upstream commands from those of `failover` itself.

For your convenience, additional administrative commands are exposed to directly
influence and override the primary election process, e.g. for maintenance:

- `<socketname>.force.ignore [seconds]` prevents the specified UPS driver from
being selected as primary for the given duration, or permanently if a negative
value is used. A value of `0` resets this override and re-enables selection.

- `<socketname>.force.primary [seconds]` forces the specified UPS driver to be
treated with the highest priority for the given duration, or permanently if a
negative value is used. A value of `0` resets this override.

Calling either command without an argument has the same effect as passing `0`,
but only for that specific override - it does not affect the other.

PRIORITIES
----------

As outlined above, primaries are dynamically elected based on their current
state and according to a strict set of user influenceable priorities, which are:

- `0` (highest): UPS driver was forced to the top by administrative command.
- `1`: UPS driver has passed the user-defined status filters.
- `2`: UPS driver has fresh data and is online (in status `OL`).
- `3`: UPS driver has fresh data, but may not be fully online.
- `4` (lowest): UPS driver is alive, but may not have fresh data.

The UPS driver with the highest calculated priority is chosen as primary, ties
are resolved through order of the socket names given within the `port` argument.

For the user-defined status filters, the following internal order is respected:

1. `status_nothave_any` (first)
2. `status_have_all`
3. `status_nothave_all`
4. `status_have_any` (last)

If `strictfiltering` is enabled, priorities 2 to 4 are not applicable.

If no user-defined status filters are set, the priority 1 is not applicable.

NOTE: The base requirement for any election is the UPS socket being connectable
and the UPS driver having published at least one full batch of data during its
lifetime. UPS driver not fulfilling that requirement are always disqualified.

RATIONALE
---------

In complex power environments, presenting a single, consistent source of UPS
information to linkman:upsmon[8] is sometimes preferable to monitoring multiple
independent drivers directly. The `failover` driver serves as a bridge, allowing
linkman:upsmon[8] to make decisions based on the most suitable available data,
without having to interpret conflicting inputs or degraded sources.

Originally designed for use cases such as dual-PSU systems or redundant
communication paths to a single UPS, `failover` also supports more advanced
setups - for example, when multiple UPSes feed a shared downstream load (via
STS/ATS switches), or when drivers vary in reliability. In these cases, the
driver can be combined with external logic or scripting to dynamically adjust
primary selection and facilitate graceful degradation. Such setups may also
benefit from further integration with the `clone` family of drivers, such as
linkman:clone[8] or linkman:clone-outlet[8], for greater granularity and
monitoring control down to the outlet level.

Additionally, in more niche scenarios, some third-party NUT integrations or
graphical interfaces may be limited to monitoring a single UPS device. In such
cases, `failover` can help by exposing only the most relevant or
highest-priority data source, allowing those tools to operate within their
constraints without missing critical information.

Ultimately, this driver enables more nuanced power monitoring and control than
binary online/offline logic alone, allowing administrators to respond to
degraded conditions early - before they escalate into critical events or require
linkman:upsmon[8] to take action.

LIMITATIONS
-----------

When using `failover` for redundancy between multiple UPS drivers connected to
the same underlying UPS device, data is not multiplexed between the drivers. As
a result, some data points may be available in some drivers but not in others.

For `checkruntime` considerations, the unit of both `battery.runtime` and
`battery.runtime.low` is assumed to be **seconds**. UPS drivers that report
these values using different units are considered non-compliant with the NUT
variable standards and should be reported to the NUT developers as faulty.

Copy link
Member

@jimklimov jimklimov May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General note for posterity: not all NUT drivers provide a battery.runtime, some might only have a battery.charge (or neither estimate/measurement).

For the purpose here, comparing charges is probably rather useless (unless the UPSes have similar capacity so runtimes would happen to compare similarly); but it may be productive to eventually focus on generalizing the runtimecal fallback, for these time numbers to be available in all/most drivers => #2420

AUTHOR
------

Sebastian Kuttnig <[email protected]>

SEE ALSO
--------

linkman:upscmd[1],
linkman:upsrw[1],
linkman:ups.conf[5],
linkman:upsc[8],
linkman:upsmon[8],
linkman:nutupsdrv[8],
linkman:clone[8],
linkman:clone-outlet[8]

Internet Resources:
~~~~~~~~~~~~~~~~~~~

The NUT (Network UPS Tools) home page: https://www.networkupstools.org/
37 changes: 27 additions & 10 deletions docs/man/nut-driver-enumerator.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,16 +19,24 @@ SYNOPSIS
DESCRIPTION
-----------

*nut-driver-enumerator.sh* implements the set-up and querying of the
mapping between NUT driver configuration sections for each individual
monitored device, and the operating system service management framework
service instances into which such drivers are wrapped for independent
execution and management (on platforms where NUT currently supports
this integration -- currently this covers Linux distributions with
systemd and systems derived from Solaris 10 codebase, including
proprietary Sun/Oracle Solaris and numerous open-source illumos
distributions with SMF). It may be not installed in packaging for
other operating systems.
The *nut-driver-enumerator.sh* (also known as "NDE") script implements the
set-up and querying of the mapping between NUT driver configuration sections
for each individual monitored device, and the service instances of an
operating system service management framework (on platforms where NUT already
supports this integration -- currently this covers Linux distributions with
systemd and systems derived from Solaris 10 codebase, including proprietary
Sun/Oracle Solaris and numerous open-source illumos distributions with SMF),
into which such drivers are wrapped for independent execution and management.
It may be not installed in packaging for other operating systems.

With each NUT driver represented as a separate service instance, dependencies
can be defined (e.g. networked drivers must start after the network ability
appears in the OS, but USB/Serial drivers should not wait for that), and they
can fail or be brought into maintenance independently (unlike a monolithic
service based on linkman:upsdrvctl[8] requiring everything configured to be
started). For a few special drivers like linkman:dummy-ups[8], linkman:clone[8],
linkman:clone-outlet[8], and linkman:failover[8] this may also involve a
dependency between service instances of different NUT drivers themselves.

This script provides a uniform interface for further NUT tools
such as linkman:upsdrvsvcctl[8] to implement their logic as
Expand All @@ -42,6 +50,15 @@ hides is the difference of rules for valid service instance names
in various frameworks, as well as system tools and naming patterns
involved.

Depending on the platform, the script may also be wrapped by different service
unit types to run automatically (e.g. upon system start-up, or regularly to
pick up changes of linkman:ups.conf[5] soon after it is edited, or integrated
with a file system monitor to be triggered when the configuration is changed).
Some of these modes make sense for use-cases with a rarely (if ever) changing
population of power devices, e.g. a home or small-office UPS monitored same
way for years at a time; others can help automate a data-center monitoring
system where device deployments (or discovery) can be much more dynamic.

COMMANDS
--------

Expand Down
Loading