Skip to content

failover.c - UPS Failover Driver #2962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
May 30, 2025

Conversation

sebastiankuttnig
Copy link
Contributor

@sebastiankuttnig sebastiankuttnig commented May 19, 2025

I'd like to present my failover driver, which includes various configurable options that I believe will be useful.

All code paths have been manually tested and verified using Valgrind, AddressSanitizer (ASan), UndefinedBehaviorSanitizer (UBSan), and LeakSanitizer (LSan). The code builds cleanly and without warnings on all systems available to me, passes make distcheck as well as compilation with -std=c89. For more details, please refer to the source code and the man page.

Significant thought and effort went into this, so I’m happy to discuss any necessary changes - ideally before direct patching. Naturally, I’ll also address any warnings or issues flagged by the CI farm to the best of my ability.

MAINTAINER NOTE: Also includes NDE script enhancements to define driver-on-driver dependencies (dummy, clone, failover).

@sebastiankuttnig

This comment was marked as resolved.

@jimklimov jimklimov added this to the 2.8.4 milestone May 20, 2025
@jimklimov
Copy link
Member

jimklimov commented May 22, 2025

... as well as compilation with -std=c89

Wow. IIRC I never managed to get that really passing for NUT CI farm, as many OS headers are not compatible with the stricter ansi, c89 or c90 (alias) standards, or in fact any "c" standard version when I last tried, so updated the developer docs to the tune of "we hope for C89 but in practice require/check GNU89 and newer".

Which system did you build on? Were there additional defines used to enable this or that code path in system headers, or was it really strict C89? Did all of NUT pass with -std=c89 there, or just this driver source was so well-behaved? :)

I suppose there is value to have a test environment in the NUT CI farm to ensure, for benefit of older or embedded systems, that our efforts to keep code features/markup similar to C89 (and e.g. avoid shorter // comments and declarations in the midst of code) are not in vain and are functionally sufficient where that matters.

UPDATE: Haha, I stand corrected. At least with whatever the CI and configure scripts do, on Ubuntu 22.04 the CFLAGS='-std=c89' CXXFLAGS='-std=cxx98' ./ci_build.sh and CFLAGS='-ansi' CXXFLAGS='-ansi' ./ci_build.sh runs just passed (almost, down to a few comment markups). So I have passed some decent milestone that I've even missed :)

FWIW, in those runs, configure report at the end is like this (notably -Wall for compiler's preferred "reasonable" picking of options is in place):

NUT Compiler settings:

NOTE: Settings for C and C++ compilers are adjusted for debugging (and minimal optimizations)

  • CC : /usr/lib/ccache/gcc
  • CFLAGS : -isystem /usr/local/include -ansi -Wno-reserved-identifier -fdiagnostics-color=always -Wno-unknown-warning-option -Wall -Wsign-compare -Werror -g3 -gdwarf-2
  • CXX : /usr/lib/ccache/g++
  • CXXFLAGS : -isystem /usr/local/include -ansi -Wno-reserved-identifier -fdiagnostics-color=always -Wno-unknown-warning-option -Wall -Wextra -Werror -O0 -g3 -gdwarf-2
  • CPP : gcc -E
  • CPPFLAGS : -O0
  • LD : /usr/bin/ld -m elf_x86_64
  • LDFLAGS :
  • CONFIG_FLAGS : --enable-Wcolor --enable-configure-debug --enable-warnings --enable-Werror --enable-keep_nut_report_feature --with-all=auto --with-cgi=auto --with-serial=auto --with-dev=auto --with-nut_monitor=auto --with-pynut=auto --disable-force-nut-version-header --enable-check-NIT --enable-maintainer-mode --with-doc=skip PKG_CONFIG_PATH=/usr/local/lib/x86_64-linux-gnu/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/share/pkgconfig:/usr/lib/x86_64-linux-gnu/pkgconfig:/usr/lib/pkgconfig:/usr/share/pkgconfig --disable-silent-rules --with-unmapped-data-points --with-debuginfo=auto CC=/usr/lib/ccache/gcc CXX=/usr/lib/ccache/g++ CPP='gcc -E'

@sebastiankuttnig
Copy link
Contributor Author

sebastiankuttnig commented May 22, 2025

... as well as compilation with -std=c89

Wow. IIRC I never managed to get that really passing for NUT CI farm, as many OS headers are not compatible with the stricter ansi, c89 or c90 (alias) standards, or in fact any "c" standard version when I last tried, so updated the developer docs to the tune of "we hope for C89 but in practice require/check GNU89 and newer".

Which system did you build on? Were there additional defines used to enable this or that code path in system headers, or was it really strict C89? Did all of NUT pass with -std=c89 there, or just this driver source was so well-behaved? :)

I suppose there is value to have a test environment in the NUT CI farm to ensure, for benefit of older or embedded systems, that our efforts to keep code features/markup similar to C89 (and e.g. avoid shorter // comments and declarations in the midst of code) are not in vain and are functionally sufficient where that matters.

Sorry, I should've clarified some more, failover.c and immediate dependencies (make failover) built under -std=c89 without warnings or catastrophic failure on my system. -pedantic would raise warnings, as to be expected, also the larger NUT build trips up later on // and other missing feature related problems.

NUT Build/Target system info:
-----------------------------

* Compact version of C compiler:        gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
* Compact version of C++ compiler:      g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
* Compact version of C preprocessor:    gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
* host env spec we run on:      x86_64-pc-linux-gnu
* host env spec we built on:    x86_64-pc-linux-gnu
* host env spec we built for:   x86_64-pc-linux-gnu
* host OS short spec we run on: x86_64-linux-gnu
* host OS short spec we built on:       x86_64-linux-gnu
* host OS short spec we built for:      x86_64-linux-gnu
* host multiarch spec we build for (as suggested by compiler being used):       x86_64-linux-gnu
* ccache namespace tag (if ccache is used and new enough):      nut:x86_64-linux-gnu
NUT Compiler settings:
----------------------

* CC            : gcc
* CFLAGS        : -isystem /usr/local/include  -std=c89 -Wno-reserved-identifier -Wno-unknown-warning-option -Wall -Wsign-compare -Wno-error  -Wextra -Werror
* CXX           : g++
* CXXFLAGS      : -isystem /usr/local/include -g -O2 -Wno-reserved-identifier -Wno-unknown-warning-option -std=gnu++11 -Wall -Wextra -Wno-error
* CPP           : gcc -E
* CPPFLAGS      : 
* LD            : /usr/bin/ld -m elf_x86_64
* LDFLAGS       : 
* CONFIG_FLAGS  : CFLAGS='-std=c89 -Wextra -Werror'

@sebastiankuttnig
Copy link
Contributor Author

UPDATE: Haha, I stand corrected. At least with whatever the CI and configure scripts do, on Ubuntu 22.04 the CFLAGS='-std=c89' CXXFLAGS='-std=cxx98' ./ci_build.sh and CFLAGS='-ansi' CXXFLAGS='-ansi' ./ci_build.sh runs just passed (almost, down to a few comment markups). So I have passed some decent milestone that I've even missed :)

That is actually really impressive for a project of this size and portability and maybe even worthy of a NEWS entry! 😉

@jimklimov
Copy link
Member

jimklimov commented May 22, 2025

UPDATE2: tried git clean -fdX && ./autogen.sh && ./configure CFLAGS='-ansi' CXXFLAGS='-ansi' --with-all=auto --enable-Werror CC=clang CXX=clang++ && make -ksj, and found that clang is much less forgiving than gcc used by default...

It warns about many non-C89/non-ANSI features in our code at least, some of which are relatively easy to fix and others may be detected in configure.ac and ifdef'ed around in code:

  • "variadic macros", flexible array members", "designated initializers" all "are a C99 feature";
  • "commas at the end of enumerator lists are a C99-specific feature";
  • "'_Bool' is a C99 extension";
  • "initializer for aggregate is not a compile-time constant" (in str.c);
  • "'long long' is an extension when C99 mode is not enabled"

... and others' code:

  • "/usr/include/libusb-1.0/libusb.h:811:62: warning: zero size arrays are an extension";
  • "compound literals are a C99-specific feature" (getting into our code from /usr/include/systemd/sd-bus.h definition of SD_BUS_ERROR_NULL)

... and numerous for clause with comma complaints like:

nutdrv_qx.c:1400:15: warning: possible misuse of comma operator here [-Wcomma]
                        for (di = 0, si = 2; si < size; si += 2) {
                                   ^
nutdrv_qx.c:1400:9: note: cast expression to void to silence warning
                        for (di = 0, si = 2; si < size; si += 2) {
                             ^~~~~~
                             (void)( )

....and probably more that I've missed (not all fit in the scrollback buffer); so I guess the documented stance remains valid.

A build with git clean -fdX && ./autogen.sh && ./configure --enable-Werror CFLAGS='-std=gnu89' CXXFLAGS='-std=gnu++98' --with-all=auto CC=clang CXX=clang++ && make -ksj is much less picky, its complaints did fit into the scrollback buffer :)

It may make sense to add regular CI runs for non-regression of C89 markup with toolkits that do work though, to help maintain the portability where we have it already. But anyhow, this theme is an off-topic for this PR discussion ;)

@jimklimov
Copy link
Member

jimklimov commented May 22, 2025

So... according to https://ci.networkupstools.org/job/nut/job/nut/job/master/lastBuild/artifact/.ci.slowBuildStages-list.txt we do run 21 hits for: GNU C89 standard builds with non-fatal warnings and GCC toolkits, without distcheck and docs (must pass) regularly, although only in one-off checks for stricter warnings https://ci.networkupstools.org/job/nut/job/nut/job/fightwarn/lastBuild/artifact/.ci.slowBuildStages-list.txt do we run 122 hits for: Strict C standard builds on non-Windows platforms, without distcheck and docs (allowed to fail) (covering c99, c11, c17 and respective cxx versions, but all with GCC it seems).

  • On the same Ubuntu worker as in earlier posts, git clean -fdX && ./autogen.sh && ./configure --enable-Werror CFLAGS='-std=c99' CXXFLAGS='-std=c++99' --with-all=auto CC=clang CXX=clang++ && make -ksj did not hiccup.

Oddly enough, the runs which do happen did not complain about comments syntax which I did hit locally; maybe those pipeline settings are older and too relaxed :)

@sebastiankuttnig
Copy link
Contributor Author

sebastiankuttnig commented May 22, 2025

So... according to https://ci.networkupstools.org/job/nut/job/nut/job/master/lastBuild/artifact/.ci.slowBuildStages-list.txt we do run 21 hits for: GNU C89 standard builds with non-fatal warnings and GCC toolkits, without distcheck and docs (must pass) regularly, although only in one-off checks for stricter warnings https://ci.networkupstools.org/job/nut/job/nut/job/fightwarn/lastBuild/artifact/.ci.slowBuildStages-list.txt do we run 122 hits for: Strict C standard builds on non-Windows platforms, without distcheck and docs (allowed to fail) (covering c99, c11, c17 and respective cxx versions, but all with GCC it seems).

Oddly enough, the runs which do happen did not complain about comments syntax which I did hit locally; maybe those pipeline settings are older and too relaxed :)

Compilers there seem more forgiving than me as far as // is concerned in C... 😆
It's definitely always interesting to see what pops up under stricter or older standards though...
Thanks for the deep-dive into the CI specifics, although I can't fully wrap my head around it just yet, I must admit.

Copy link
Member

@jimklimov jimklimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, but there are some nits about documentation (maybe not all are relevant upon reading the code, which I started diligently but had to rush the end of...)

@sebastiankuttnig
Copy link
Contributor Author

Thanks for the detailed feedback, I appreciate it and hope to get into it in detail this weekend.
Until then we should hopefully also see if the CI turns up anything else that also needs amending.

@jimklimov
Copy link
Member

Ok, I'll post a couple more commits until then :)

@sebastiankuttnig
Copy link
Contributor Author

sebastiankuttnig commented May 22, 2025

Ok, I'll post a couple more commits until then :)

Sure thing, and appreciate the feedback, tried to address everything until I can touch the code. Fwiw, if you're unclear about anything while doing that, in doubt follow and trust the code paths... the documentation came later and has some rough edges still.... especially visible now through your fresh eyes 😉

Reading more into your feedback and your initial thought process of this being a non-multiplexed failover driver for same-UPS connections (with different drivers), I wonder if it might actually make sense here to do a mode=multiplex (and default mode=failover) right within this driver (for such many drivers to one single UPS connections) since we already have that extensive base for keeping track of different states, variables, commands and a lot of code would probably overlap? Probably not for the first version of this driver, but as a future perspective...

EDIT: With today's fresher perspective, I guess with the driver already at 2.5k lines as it is, perhaps better split into a multiplexor driver after all - might be easier to maintain also, not having to factor in two operational modes for each change.

@AppVeyorBot
Copy link

Signed-off-by: Sebastian Kuttnig <[email protected]>
…shot connections

This change removes the dependency on splitting socket names into UPS and driver names.
We now rely on the user to provide valid sockets: if it connects, it connects.

This enables the use of full paths in the port argument also, improving flexibility and
enhancing scriptability.

Signed-off-by: Sebastian Kuttnig <[email protected]>
@sebastiankuttnig
Copy link
Contributor Author

sebastiankuttnig commented May 28, 2025

If the CI permits it, this would be complete for 0.01 from my side. We now use the new _sockfn() function, the socket name splitting dependency was fully removed. The code now also uses the safe conversion helpers str_to_* and does argument validation for all numeric ranges, falling back to defaults if conversions or validations fail. This also includes instcmd() where STAT_INSTCMD_CONVERSION_FAILED is returned on conversion failure (and logs are recorded at LOG_INSTCMD_CONVERSION_FAILED).

I decided to push the runtime tie-breaker at OB levels to a future 0.02, as UPS may report runtime in seconds or minutes and I need to think about the logic some more. I'd like to off-load this into a separate PR, as this one is quite extensive already. 😄

Of course, happy to fine-tune further on the existing documentation or code, should something else surface.

@jimklimov : Would you like to add a docs/NEWS entry for your enumerator changes?
I'm not sure how what you added fully works, so maybe that'd be helpful for others also.

@sebastiankuttnig sebastiankuttnig marked this pull request as ready for review May 28, 2025 05:04
@jimklimov
Copy link
Member

UPS may report runtime in seconds or minutes

docs/nut-names.txt says there should be seconds. If some driver does not convert correctly, that's a bug (or misconfiguration, in case there are firmware-specific toggles to factor something x10, x60, etc. but the user did not enable them).

add a docs/NEWS entry for your enumerator changes

Good point, on it.

@AppVeyorBot
Copy link

@sebastiankuttnig
Copy link
Contributor Author

UPS may report runtime in seconds or minutes

docs/nut-names.txt says there should be seconds. If some driver does not convert correctly, that's a bug (or misconfiguration, in case there are firmware-specific toggles to factor something x10, x60, etc. but the user did not enable them).

Thanks, that simplifies the logic, I've added and tested the optional checkruntime argument:

*checkruntime*='0|1|2|3'::
Optional. Controls how `battery.runtime` values are used to break ties between
non-fully-online UPS devices **at priority 3 or lower**. Has no effect on
initial priority selection or when `strictfiltering` is enabled. Defaults to 1.

- `0`: *Disabled.* No runtime comparison is done. The first candidate with the
best priority is selected according to the order of the port argument.

- `1`: *Compare `battery.runtime`.* The UPS with the higher value is preferred.
If the value is missing or invalid, the UPS cannot win the tie-break.

- `2`: *Compare `battery.runtime.low`.* The UPS with the higher value is
preferred. If the value is missing or invalid, the UPS cannot win the tie-break.

- `3`: *Compare both variables strictly.* The UPS is preferred only if it has
both a higher `battery.runtime` and `battery.runtime.low` value. If either is
missing or invalid, the UPS cannot win the tie-break.

As well as the following information to the LIMITATIONS:

For `checkruntime` considerations, the unit of both `battery.runtime` and
`battery.runtime.low` is assumed to be **seconds**. UPS drivers that report
these values using different units are considered non-compliant with the NUT
variable standards and should be reported to the NUT developers as faulty.

I hope that should complement nicely the configurational toggles that are already in place. 🙂

These changes avoid constant lookups in the variable table for better performance.

Signed-off-by: Sebastian Kuttnig <[email protected]>
…ference

Adds checks to ensure ups->status is not NULL before use, preventing a possible NULL pointer dereference.

Signed-off-by: Sebastian Kuttnig <[email protected]>
@AppVeyorBot
Copy link

@sebastiankuttnig
Copy link
Contributor Author

sebastiankuttnig commented May 29, 2025

Good to merge from my side if nothing else surfaces, CI seems content. 🙂
Appreciate the collaborative effort in refining the initial version to where it is now!

`battery.runtime.low` is assumed to be **seconds**. UPS drivers that report
these values using different units are considered non-compliant with the NUT
variable standards and should be reported to the NUT developers as faulty.

Copy link
Member

@jimklimov jimklimov May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General note for posterity: not all NUT drivers provide a battery.runtime, some might only have a battery.charge (or neither estimate/measurement).

For the purpose here, comparing charges is probably rather useless (unless the UPSes have similar capacity so runtimes would happen to compare similarly); but it may be productive to eventually focus on generalizing the runtimecal fallback, for these time numbers to be available in all/most drivers => #2420

Comment on lines +126 to +127
- `2`: *Compare `battery.runtime.low`.* The UPS with the higher value is
preferred. If the value is missing or invalid, the UPS cannot win the tie-break.
Copy link
Member

@jimklimov jimklimov May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful this one is for such comparisons, being a fixed value that may be (derived from) a user setting:

Remaining battery runtime when UPS switches to LB (seconds)

It may also be irrelevant if e.g. upssched is used with upsmon for a custom shutdown strategy like "if OB took longer than 5 min to recover from" regardless of battery charge/runtime remaining.

But for some users their (own or device's) setting may be a measure of UPS reliability, so why not.

Copy link
Contributor Author

@sebastiankuttnig sebastiankuttnig May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I understood this as: this is the seconds left (timer) before the UPS switches to LB. Want me to remove it?

Copy link
Member

@jimklimov jimklimov May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can stay, just needed a second glance at usefulness. This can be a factor for some shutdown scenarios - "this UPS will give me a 5-minute FSD window to shut down gracefully (because that's when it becomes OB+LB)", although for specifically shutdowns with upsmon - more likely the "real" driver would be consulted as one of several supplies, than the failover one. But as you said, for single-UPS clients this may still be relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the double check. 👍

@jimklimov jimklimov added nut-driver-enumerator (NDE) nut-driver-enumerator (NDE) automates service management integration for NUT driver instances etc. documentation labels May 29, 2025
@jimklimov jimklimov merged commit f260d28 into networkupstools:master May 30, 2025
30 checks passed
@jimklimov
Copy link
Member

Thanks again for the hard work and a lot of patience :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation feature nut-driver-enumerator (NDE) nut-driver-enumerator (NDE) automates service management integration for NUT driver instances etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants