Dev: sbd: Improve the process of leveraging maintenance mode (#1950)

liangxin1300 · web-flow · commit 0c3bd0aa0103 · 2025-11-24T17:15:56.000+08:00
## Problem #1744 leverage maintenance mode when needs to restart cluster, but there are still some problems when resources are running: #### Configuration changed before hinting, might lead to inconsistent ``` # crm sbd configure watchdog-timeout=45 INFO: No 'msgwait-timeout=' specified in the command, use 2*watchdog timeout: 90 INFO: Configuring disk-based SBD INFO: Initializing SBD device /dev/sda5 INFO: Update SBD_WATCHDOG_DEV in /etc/sysconfig/sbd: /dev/watchdog0 INFO: Sync file /etc/sysconfig/sbd to sle16-2 INFO: Already synced /etc/sysconfig/sbd to all nodes INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 131 INFO: Sync file /etc/sysconfig/sbd to sle16-2 INFO: Already synced /etc/sysconfig/sbd to all nodes WARNING: "stonith-timeout" in crm_config is set to 119, it was 71 INFO: Sync directory /etc/systemd/system/sbd.service.d to sle16-2 WARNING: Resource is running, need to restart cluster service manually on each node WARNING: Or, run with `crm -F` or `--force` option, the `sbd` subcommand will leverage maintenance mode for any changes that require restarting sbd.service WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting # crm sbd purge INFO: Stop sbd resource 'stonith-sbd'(stonith:fence_sbd) INFO: Remove sbd resource 'stonith-sbd' INFO: Disable sbd.service on node sle16-1 INFO: Disable sbd.service on node sle16-2 INFO: Move /etc/sysconfig/sbd to /etc/sysconfig/sbd.bak on all nodes INFO: Delete cluster property "stonith-timeout" in crm_config INFO: Delete cluster property "priority-fencing-delay" in crm_config WARNING: "stonith-enabled" in crm_config is set to false, it was true WARNING: Resource is running, need to restart cluster service manually on each node WARNING: Or, run with `crm -F` or `--force` option, the `sbd` subcommand will leverage maintenance mode for any changes that require restarting sbd.service WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting ``` #### Pacemaker fatal exit when adding diskless sbd on a running cluster with resources running ``` # crm cluster init sbd -S -y INFO: Loading "default" profile from /etc/crm/profiles.yml INFO: Loading "knet-default" profile from /etc/crm/profiles.yml INFO: Configuring diskless SBD WARNING: Diskless SBD requires cluster with three or more nodes. If you want to use diskless SBD for 2-node cluster, should be combined with QDevice. INFO: Update SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd: 15 INFO: Update SBD_WATCHDOG_DEV in /etc/sysconfig/sbd: /dev/watchdog0 INFO: Sync file /etc/sysconfig/sbd to sle16-2 INFO: Already synced /etc/sysconfig/sbd to all nodes INFO: Enable sbd.service on node sle16-1 INFO: Enable sbd.service on node sle16-2 WARNING: Resource is running, need to restart cluster service manually on each node WARNING: Or, run with `crm -F` or `--force` option, the `sbd` subcommand will leverage maintenance mode for any changes that require restarting sbd.service WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting WARNING: "stonith-watchdog-timeout" in crm_config is set to 30, it was 0 Broadcast message from systemd-journald@sle16-1 (Thu 2025-10-23 10:54:11 CEST): pacemaker-controld[5674]: emerg: Shutting down: stonith-watchdog-timeout configured (30) but SBD not active Message from syslogd@sle16-1 at Oct 23 10:54:11 ... pacemaker-controld[5674]: emerg: Shutting down: stonith-watchdog-timeout configured (30) but SBD not active ERROR: cluster.init: Failed to run 'crm configure property stonith-watchdog-timeout=30': ERROR: Failed to run 'crm_mon -1rR': crm_mon: Connection to cluster failed: Connection refused ``` ## Solution - Drop the function `restart_cluster_if_possible` - Introduced a new function `utils.able_to_restart_cluster` to check if the cluster can be restarted. Call it before changing any configurations. - Add leverage maintenance mode in `sbd device remove` and `sbd purge` commands #### Add sbd via sbd stage while resource is running ``` # crm cluster init sbd -S -y INFO: Loading "default" profile from /etc/crm/profiles.yml INFO: Loading "knet-default" profile from /etc/crm/profiles.yml WARNING: Please stop all running resources and try again WARNING: Or use 'crm -F/--force' option to leverage maintenance mode WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting INFO: Aborting the configuration change attempt INFO: Done (log saved to /var/log/crmsh/crmsh.log on sle16-1) # Leverage maintenance mode # crm -F cluster init sbd -S -y INFO: Loading "default" profile from /etc/crm/profiles.yml INFO: Loading "knet-default" profile from /etc/crm/profiles.yml INFO: Set cluster to maintenance mode WARNING: "maintenance-mode" in crm_config is set to true, it was false INFO: Configuring diskless SBD WARNING: Diskless SBD requires cluster with three or more nodes. If you want to use diskless SBD for 2-node cluster, should be combined with QDevice. INFO: Update SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd: 15 INFO: Update SBD_WATCHDOG_DEV in /etc/sysconfig/sbd: /dev/watchdog0 INFO: Sync file /etc/sysconfig/sbd to sle16-2 INFO: Already synced /etc/sysconfig/sbd to all nodes INFO: Enable sbd.service on node sle16-1 INFO: Enable sbd.service on node sle16-2 INFO: Restarting cluster service INFO: BEGIN Waiting for cluster ........... INFO: END Waiting for cluster WARNING: "stonith-watchdog-timeout" in crm_config is set to 30, it was 0 WARNING: "stonith-enabled" in crm_config is set to true, it was false INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 41 INFO: Sync file /etc/sysconfig/sbd to sle16-2 INFO: Already synced /etc/sysconfig/sbd to all nodes WARNING: "stonith-timeout" in crm_config is set to 71, it was 60s INFO: Set cluster from maintenance mode to normal INFO: Delete cluster property "maintenance-mode" in crm_config INFO: Done (log saved to /var/log/crmsh/crmsh.log on sle16-1) ``` #### Purge sbd while resource is running ``` # crm sbd purge WARNING: Please stop all running resources and try again WARNING: Or use 'crm -F/--force' option to leverage maintenance mode WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting INFO: Aborting the configuration change attempt ``` #### Add device ``` # crm sbd device add /dev/sda6 INFO: Configured sbd devices: /dev/sda5 INFO: Append devices: /dev/sda6 WARNING: Please stop all running resources and try again WARNING: Or use 'crm -F/--force' option to leverage maintenance mode WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting INFO: Aborting the configuration change attempt ``` #### Remove device ``` # crm sbd device remove /dev/sda6 INFO: Configured sbd devices: /dev/sda5;/dev/sda6 INFO: Remove devices: /dev/sda6 WARNING: Please stop all running resources and try again WARNING: Or use 'crm -F/--force' option to leverage maintenance mode WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting INFO: Aborting the configuration change attempt ``` #### Configure sbd while DLM is running ``` # crm sbd configure watchdog-timeout=40 INFO: No 'msgwait-timeout=' specified in the command, use 2*watchdog timeout: 80 WARNING: Please stop all running resources and try again WARNING: Or use 'crm -F/--force' option to leverage maintenance mode WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting INFO: Aborting the configuration change attempt # Leverage maintenance mode # crm -F sbd configure watchdog-timeout=40 INFO: No 'msgwait-timeout=' specified in the command, use 2*watchdog timeout: 80 INFO: Set cluster to maintenance mode WARNING: "maintenance-mode" in crm_config is set to true, it was false WARNING: Please stop DLM related resources (gfs2-clone) and try again INFO: Set cluster from maintenance mode to normal INFO: Delete cluster property "maintenance-mode" in crm_config ```
diff --git a/crmsh/sbd.py b/crmsh/sbd.py
@@ -593,22 +593,6 @@ def enable_sbd_service(self):
                 logger.info("Enable %s on node %s", constants.SBD_SERVICE, node)
                 service_manager.enable_service(constants.SBD_SERVICE, node)
 
-    @staticmethod
-    def restart_cluster_if_possible(with_maintenance_mode=False):
-        if not ServiceManager().service_is_active(constants.PCMK_SERVICE):
-            return
-        if not xmlutil.CrmMonXmlParser().is_non_stonith_resource_running():
-            bootstrap.restart_cluster()
-        elif with_maintenance_mode:
-            if not utils.is_dlm_running():
-                bootstrap.restart_cluster()
-            else:
-                logger.warning("Resource is running, need to restart cluster service manually on each node")
-        else:
-            logger.warning("Resource is running, need to restart cluster service manually on each node")
-            logger.warning("Or, run with `crm -F` or `--force` option, the `sbd` subcommand will leverage maintenance mode for any changes that require restarting sbd.service")
-            logger.warning("Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting")
-
     def configure_sbd(self):
         '''
         Configure fence_sbd resource and related properties
@@ -746,6 +730,9 @@ def init_and_deploy_sbd(self, restart_first=False):
             self._load_attributes_from_bootstrap()
 
         with utils.leverage_maintenance_mode() as enabled:
+            if not utils.able_to_restart_cluster(enabled):
+                return
+
             self.initialize_sbd()
             self.update_configuration()
             self.enable_sbd_service()
@@ -760,7 +747,7 @@ def init_and_deploy_sbd(self, restart_first=False):
                 restart_cluster_first = restart_first or \
                         (self.diskless_sbd and not ServiceManager().service_is_active(constants.SBD_SERVICE))
                 if restart_cluster_first:
-                    SBDManager.restart_cluster_if_possible(with_maintenance_mode=enabled)
+                    bootstrap.restart_cluster()
 
                 self.configure_sbd()
                 bootstrap.adjust_properties(with_sbd=True)
@@ -770,7 +757,7 @@ def init_and_deploy_sbd(self, restart_first=False):
                 # This helps prevent unexpected issues, such as nodes being fenced
                 # due to large SBD_WATCHDOG_TIMEOUT values combined with smaller timeouts.
                 if not restart_cluster_first:
-                    SBDManager.restart_cluster_if_possible(with_maintenance_mode=enabled)
+                    bootstrap.restart_cluster()
 
     def join_sbd(self, remote_user, peer_host):
         '''
diff --git a/crmsh/ui_sbd.py b/crmsh/ui_sbd.py
@@ -517,8 +517,11 @@ def _device_remove(self, devices_to_remove: typing.List[str]):
 
         logger.info("Remove devices: %s", ';'.join(devices_to_remove))
         update_dict = {"SBD_DEVICE": ";".join(left_device_list)}
-        sbd.SBDManager.update_sbd_configuration(update_dict)
-        sbd.SBDManager.restart_cluster_if_possible()
+        with utils.leverage_maintenance_mode() as enabled:
+            if not utils.able_to_restart_cluster(enabled):
+                return
+            sbd.SBDManager.update_sbd_configuration(update_dict)
+            bootstrap.restart_cluster()
 
     @command.completers_repeating(sbd_device_completer)
     def do_device(self, context, *args) -> bool:
@@ -601,22 +604,34 @@ def do_purge(self, context, *args) -> bool:
         if not self._service_is_active(constants.SBD_SERVICE):
             return False
 
+        purge_crashdump = False
+        if args:
+            if args[0] == "crashdump":
+                if not self._is_crashdump_configured():
+                    logger.error("SBD crashdump is not configured")
+                    return False
+                purge_crashdump = True
+            else:
+                logger.error("Invalid argument: %s", ' '.join(args))
+                logger.info("Usage: crm sbd purge [crashdump]")
+                return False
+
         utils.check_all_nodes_reachable("purging SBD")
 
-        if args and args[0] == "crashdump":
-            if not self._is_crashdump_configured():
-                logger.error("SBD crashdump is not configured")
+        with utils.leverage_maintenance_mode() as enabled:
+            if not utils.able_to_restart_cluster(enabled):
                 return False
-            self._set_crashdump_option(delete=True)
-            update_dict = self._set_crashdump_in_sysconfig(restore=True)
-            if update_dict:
-                sbd.SBDManager.update_sbd_configuration(update_dict)
-                sbd.SBDManager.restart_cluster_if_possible()
-            return True
 
-        sbd.purge_sbd_from_cluster()
-        sbd.SBDManager.restart_cluster_if_possible()
-        return True
+            if purge_crashdump:
+                self._set_crashdump_option(delete=True)
+                update_dict = self._set_crashdump_in_sysconfig(restore=True)
+                if update_dict:
+                    sbd.SBDManager.update_sbd_configuration(update_dict)
+            else:
+                sbd.purge_sbd_from_cluster()
+
+            bootstrap.restart_cluster()
+            return True
 
     def _print_sbd_type(self):
         if not self.service_manager.service_is_active(constants.SBD_SERVICE):
diff --git a/crmsh/utils.py b/crmsh/utils.py
@@ -3306,4 +3306,32 @@ def validate_and_get_reachable_nodes(
             member_list.remove(node)
 
     return member_list + remote_list
+
+
+def able_to_restart_cluster(in_maintenance_mode: bool = False) -> bool:
+    """
+    Check whether it is able to restart cluster now
+    1. If pacemaker is not running, return True
+    2. If no non-stonith resource is running, return True
+    3. If in maintenance mode and DLM is not running, return True
+    4. Otherwise, return False with warning messages to guide user
+    """
+    if not ServiceManager().service_is_active(constants.PCMK_SERVICE):
+        return True
+    crm_mon_parser = xmlutil.CrmMonXmlParser()
+    if not crm_mon_parser.is_non_stonith_resource_running():
+        return True
+    elif in_maintenance_mode:
+        if is_dlm_running():
+            dlm_related_ids = crm_mon_parser.get_resource_top_parent_id_set_via_type(constants.DLM_CONTROLD_RA)
+            logger.warning("Please stop DLM related resources (%s) and try again", ', '.join(dlm_related_ids))
+            return False
+        else:
+            return True
+    else:
+        logger.warning("Please stop all running resources and try again")
+        logger.warning("Or use 'crm -F/--force' option to leverage maintenance mode")
+        logger.warning("Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting")
+        logger.info("Aborting the configuration change attempt")
+        return False
 # vim:ts=4:sw=4:et:
diff --git a/crmsh/xmlutil.py b/crmsh/xmlutil.py
@@ -1627,6 +1627,13 @@ def is_resource_started(self, ra):
         xpath = f'//resource[(@id="{ra}" or @resource_agent="{ra}") and @active="true" and @role="Started"]'
         return bool(self.xml_elem.xpath(xpath))
 
+    def get_resource_top_parent_id_set_via_type(self, ra_type):
+        """
+        Given configured ra type, get the topmost parent ra id set
+        """
+        xpath = f'//resource[@resource_agent="{ra_type}"]'
+        return set([get_topmost_rsc(elem).get('id') for elem in self.xml_elem.xpath(xpath)])
+
     def get_resource_id_list_via_type(self, ra_type):
         """
         Given configured ra type, get the ra id list
diff --git a/test/features/sbd_ui.feature b/test/features/sbd_ui.feature
@@ -132,3 +132,21 @@ Feature: crm sbd ui test cases
     And     Run "crm cluster restart --all" on "hanode1"
     Then    Service "sbd.service" is "stopped" on "hanode1"
     Then    Service "sbd.service" is "stopped" on "hanode2"
+
+  @clean
+  Scenario: Leverage maintenance mode
+    When    Run "crm cluster init -y" on "hanode1"
+    And     Run "crm cluster join -c hanode1 -y" on "hanode2"
+    Then    Cluster service is "started" on "hanode1"
+    Then    Cluster service is "started" on "hanode2"
+    When    Run "crm configure primitive d Dummy" on "hanode1"
+    When    Try "crm cluster init sbd -s /dev/sda5 -y"
+    Then    Expected "Or use 'crm -F/--force' option to leverage maintenance mode" in stderr
+    When    Run "crm -F cluster init sbd -s /dev/sda5 -y" on "hanode1"
+    Then    Service "sbd" is "started" on "hanode1"
+    And     Service "sbd" is "started" on "hanode2"
+    When    Try "crm sbd purge"
+    Then    Expected "Or use 'crm -F/--force' option to leverage maintenance mode" in stderr
+    When    Run "crm -F sbd purge" on "hanode1"
+    Then    Service "sbd.service" is "stopped" on "hanode1"
+    Then    Service "sbd.service" is "stopped" on "hanode2"
diff --git a/test/unittests/test_sbd.py b/test/unittests/test_sbd.py
@@ -406,57 +406,6 @@ def test_enable_sbd_service(self, mock_list_cluster_nodes, mock_ServiceManager,
             call("Enable %s on node %s", constants.SBD_SERVICE, 'node2')
         ])
 
-    @patch('crmsh.xmlutil.CrmMonXmlParser')
-    @patch('crmsh.sbd.ServiceManager')
-    def test_restart_cluster_if_possible_return(self, mock_ServiceManager, mock_CrmMonXmlParser):
-        mock_ServiceManager.return_value.service_is_active.return_value = False
-        SBDManager.restart_cluster_if_possible()
-        mock_ServiceManager.return_value.service_is_active.assert_called_once_with(constants.PCMK_SERVICE)
-        mock_CrmMonXmlParser.assert_not_called()
-
-    @patch('logging.Logger.warning')
-    @patch('crmsh.utils.is_dlm_running')
-    @patch('crmsh.xmlutil.CrmMonXmlParser')
-    @patch('crmsh.sbd.ServiceManager')
-    def test_restart_cluster_if_possible_manually(
-            self, mock_ServiceManager, mock_CrmMonXmlParser, mock_is_dlm_running, mock_logger_warning,
-    ):
-        mock_ServiceManager.return_value.service_is_active.return_value = True
-        mock_CrmMonXmlParser.return_value.is_non_stonith_resource_running.return_value = True
-        mock_is_dlm_running.return_value = False
-        SBDManager.restart_cluster_if_possible()
-        mock_ServiceManager.return_value.service_is_active.assert_called_once_with(constants.PCMK_SERVICE)
-        mock_logger_warning.assert_has_calls([
-            call("Resource is running, need to restart cluster service manually on each node"),
-            call("Or, run with `crm -F` or `--force` option, the `sbd` subcommand will leverage maintenance mode for any changes that require restarting sbd.service"),
-            call("Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting")
-        ])
-
-    @patch('logging.Logger.warning')
-    @patch('crmsh.utils.is_dlm_running')
-    @patch('crmsh.xmlutil.CrmMonXmlParser')
-    @patch('crmsh.sbd.ServiceManager')
-    def test_restart_cluster_if_possible_dlm_running(
-            self, mock_ServiceManager, mock_CrmMonXmlParser, mock_is_dlm_running, mock_logger_warning,
-    ):
-        mock_ServiceManager.return_value.service_is_active.return_value = True
-        mock_CrmMonXmlParser.return_value.is_non_stonith_resource_running.return_value = True
-        mock_is_dlm_running.return_value = True
-        SBDManager.restart_cluster_if_possible(with_maintenance_mode=True)
-        mock_ServiceManager.return_value.service_is_active.assert_called_once_with(constants.PCMK_SERVICE)
-        mock_logger_warning.assert_called_once_with("Resource is running, need to restart cluster service manually on each node")
-
-    @patch('crmsh.bootstrap.restart_cluster')
-    @patch('logging.Logger.warning')
-    @patch('crmsh.xmlutil.CrmMonXmlParser')
-    @patch('crmsh.sbd.ServiceManager')
-    def test_restart_cluster_if_possible(self, mock_ServiceManager, mock_CrmMonXmlParser, mock_logger_warning, mock_restart_cluster):
-        mock_ServiceManager.return_value.service_is_active.return_value = True
-        mock_CrmMonXmlParser.return_value.is_non_stonith_resource_running.return_value = False
-        SBDManager.restart_cluster_if_possible()
-        mock_ServiceManager.return_value.service_is_active.assert_called_once_with(constants.PCMK_SERVICE)
-        mock_restart_cluster.assert_called_once()
-
     @patch('crmsh.bootstrap.prompt_for_string')
     def test_prompt_for_sbd_device_diskless(self, mock_prompt_for_string):
         mock_prompt_for_string.return_value = "none"
@@ -644,10 +593,10 @@ def test_init_and_deploy_sbd_not_config_sbd(self, mock_ServiceManager):
         sbdmanager_instance._load_attributes_from_bootstrap.assert_not_called()
 
     @patch('crmsh.bootstrap.adjust_properties')
-    @patch('crmsh.sbd.SBDManager.restart_cluster_if_possible')
+    @patch('crmsh.bootstrap.restart_cluster')
     @patch('crmsh.sbd.SBDManager.enable_sbd_service')
     @patch('crmsh.sbd.ServiceManager')
-    def test_init_and_deploy_sbd(self, mock_ServiceManager, mock_enable_sbd_service, mock_restart_cluster_if_possible, mock_adjust_properties):
+    def test_init_and_deploy_sbd(self, mock_ServiceManager, mock_enable_sbd_service, mock_restart_cluster, mock_adjust_properties):
         mock_bootstrap_ctx = Mock(cluster_is_running=True)
         sbdmanager_instance = SBDManager(bootstrap_context=mock_bootstrap_ctx)
         sbdmanager_instance.get_sbd_device_from_bootstrap = Mock()
diff --git a/test/unittests/test_ui_sbd.py b/test/unittests/test_ui_sbd.py
@@ -469,14 +469,14 @@ def test_device_remove_last_dev(self):
             self.sbd_instance_diskbased._device_remove(["/dev/sda1"])
         self.assertEqual(str(e.exception), "Not allowed to remove all devices")
 
-    @mock.patch('crmsh.sbd.SBDManager.restart_cluster_if_possible')
+    @mock.patch('crmsh.bootstrap.restart_cluster')
     @mock.patch('crmsh.sbd.SBDManager.update_sbd_configuration')
     @mock.patch('logging.Logger.info')
-    def test_device_remove(self, mock_logger_info, mock_update_sbd_configuration, mock_restart_cluster_if_possible):
+    def test_device_remove(self, mock_logger_info, mock_update_sbd_configuration, mock_restart_cluster):
         self.sbd_instance_diskbased.device_list_from_config = ["/dev/sda1", "/dev/sda2"]
         self.sbd_instance_diskbased._device_remove(["/dev/sda1"])
         mock_update_sbd_configuration.assert_called_once_with({"SBD_DEVICE": "/dev/sda2"})
-        mock_restart_cluster_if_possible.assert_called_once()
+        mock_restart_cluster.assert_called_once()
         mock_logger_info.assert_called_once_with("Remove devices: %s", "/dev/sda1")
 
     def test_do_device_no_service(self):
@@ -571,9 +571,10 @@ def test_do_purge_no_service(self, mock_purge_sbd_from_cluster):
         self.assertFalse(res)
         mock_purge_sbd_from_cluster.assert_not_called()
 
+    @mock.patch('crmsh.bootstrap.restart_cluster')
     @mock.patch('crmsh.utils.check_all_nodes_reachable')
     @mock.patch('crmsh.sbd.purge_sbd_from_cluster')
-    def test_do_purge(self, mock_purge_sbd_from_cluster, mock_check_all_nodes_reachable):
+    def test_do_purge(self, mock_purge_sbd_from_cluster, mock_check_all_nodes_reachable, mock_restart_cluster):
         self.sbd_instance_diskbased._load_attributes = mock.Mock()
         self.sbd_instance_diskbased._service_is_active = mock.Mock(return_value=True)
         res = self.sbd_instance_diskbased.do_purge(mock.Mock())