diff --git a/docs/user_guide/command_line.md b/docs/user_guide/command_line.md index 2bdb1f8b..1b9505a1 100644 --- a/docs/user_guide/command_line.md +++ b/docs/user_guide/command_line.md @@ -935,14 +935,16 @@ merlin database cleanup [OPTIONS] **Options:** -| Name | Type | Description | Default | -| ------------ | ------- | ----------- | ------- | -| `-h`, `--help` | boolean | Show this help message and exit | `False` | -| `--dry-run` | boolean | Show what would be deleted without actually deleting anything | `False` | -| `--skip-runs` | boolean | Skip checking for runs with invalid workspaces | `False` | -| `--skip-workers` | boolean | Skip checking for orphaned workers | `False` | -| `--skip-studies` | boolean | Skip checking for empty studies | `False` | -| `-f`, `--force` | boolean | Skip confirmation prompt (use with caution) | `False` | +| Name | Type | Description | Default | +| ------------------------- | ------- | --------------------------------------------------------------- | ------- | +| `-h`, `--help` | boolean | Show this help message and exit | `False` | +| `--dry-run` | boolean | Show what would be deleted without actually deleting anything | `False` | +| `--skip-runs` | boolean | Skip checking for runs with invalid workspaces | `False` | +| `--skip-workers` | boolean | Skip checking for orphaned workers (both logical and physical) | `False` | +| `--skip-logical-workers` | boolean | Skip checking for orphaned logical workers | `False` | +| `--skip-physical-workers` | boolean | Skip checking for orphaned physical workers | `False` | +| `--skip-studies` | boolean | Skip checking for empty studies | `False` | +| `-f`, `--force` | boolean | Skip confirmation prompt (use with caution) | `False` | **Examples:** diff --git a/docs/user_guide/database/garbage_collection.md b/docs/user_guide/database/garbage_collection.md index 45af2ebb..f78a6c3f 100644 --- a/docs/user_guide/database/garbage_collection.md +++ b/docs/user_guide/database/garbage_collection.md @@ -31,6 +31,49 @@ The garbage collection process follows a cascading cleanup approach: This cascading approach ensures that removing a run with a missing workspace automatically cleans up its dependent entities. +### Distributed Filesystem Awareness + +!!! info "Multi-Machine Considerations" + + Merlin is designed to work across distributed computing environments where different machines may have access to different filesystems. The garbage collector intelligently handles this by: + + - **Detecting accessible mount points**: Automatically identifies which filesystems are mounted on the current machine + - **Skipping inaccessible workspaces**: Won't flag workspaces as invalid if they're on filesystems not mounted on the current machine + - **Logging inaccessible workspaces**: Provides informative warnings about workspaces that couldn't be verified + +When garbage collection runs, it checks whether each workspace is on a filesystem that's accessible from the current machine: + +- **Accessible workspaces** (e.g., on `/p/lustre3` when that filesystem is mounted): Checked for existence and flagged as invalid if missing +- **Inaccessible workspaces** (e.g., on `/p/lustre3` when that filesystem is NOT mounted): Skipped with a warning, not flagged as invalid +- **Local workspaces** (e.g., on root filesystem like `/tmp`): Always checked if they exist locally + +??? example "Example: Running Garbage Collection on Different Machines" + + **Scenario**: You have workspaces on `/p/lustre3` (shared filesystem). + + **Machine A** (has `/p/lustre3` mounted): + + ```bash + $ merlin database gc --dry-run + [INFO] [GARBAGE COLLECTOR] Checking run workspaces for validity... + [INFO] [GARBAGE COLLECTOR] Found 2 runs with invalid workspaces. + ... + ``` + + **Machine B** (does NOT have `/p/lustre3` mounted): + + ```bash + $ merlin database gc --dry-run + [INFO] [GARBAGE COLLECTOR] Checking run workspaces for validity... + [INFO] [GARBAGE COLLECTOR] Found 0 runs with invalid workspaces. + [WARNING] [GARBAGE COLLECTOR] Found 5 runs with workspaces on file systems not + accessible from the current host 'machine-b'. Run garbage collection + from a machine with access to verify these. + ... + ``` + + The 5 workspaces on `/p/lustre3` are safely skipped on Machine B because that filesystem isn't mounted there. Run garbage collection from Machine A to properly verify those workspaces. + ## Basic Usage To preview what would be removed without actually deleting anything: @@ -44,6 +87,7 @@ This displays a detailed report showing: - Which runs have missing workspaces - Which workers would be removed due to having no valid runs - Which studies would be removed due to having no remaining runs +- Which workspaces are on filesystems not accessible from the current machine (for informational purposes) ??? example "Example Output for Dry Run" @@ -57,6 +101,9 @@ This displays a detailed report showing: [2025-10-08 10:27:31: INFO] Fetching all runs from Redis... [2025-10-08 10:27:31: INFO] Successfully retrieved 2 runs from Redis. [2025-10-08 10:27:31: INFO] [GARBAGE COLLECTOR] Found 1 runs with invalid workspaces. + [2025-10-08 10:27:31: WARNING] [GARBAGE COLLECTOR] Found 1 runs with workspaces on file systems not + accessible from the current host 'machine-a'. Run garbage collection + from a machine with access to verify these. [2025-10-08 10:27:31: INFO] [GARBAGE COLLECTOR] Checking for orphaned logical workers... [2025-10-08 10:27:31: INFO] Fetching all runs from Redis... [2025-10-08 10:27:31: INFO] Successfully retrieved 2 runs from Redis. @@ -82,16 +129,30 @@ This displays a detailed report showing: ============================================================ Invalid Runs: 1 - - /path/to/hello_samples_20251008-102348 + - /path/to/hello_samples_20251008-102348 + + Inaccessible Workspaces: 1 + - /p/lustre3/other_workspace_20251008-101234 + (on filesystem not accessible from current host) Orphaned Logical Workers: 1 - - hello_samples_worker (queues: [merlin]_step_1_queue, [merlin]_step_2_queue) + - hello_samples_worker (queues: [merlin]_step_1_queue, [merlin]_step_2_queue) Orphaned Physical Workers: 1 - - celery@hello_samples_worker.%dane13 (host: dane13) + - celery@hello_samples_worker.%dane13 (host: dane13) Empty Studies: 1 - - hello_samples + - hello_samples + + ============================================================ + + Potentially Inaccessible Runs: 1 + - /p/lustre3/other_workspace_20251008-101234 + + You may need to re-run garbage collection on a machine that + can access these runs, or remove them manually if they are + local runs being flagged as inaccessible. + ============================================================ ``` @@ -150,6 +211,11 @@ Orphaned Physical Workers: 1 Empty Studies: 1 - hello_samples + +============================================================ + +Potentially Inaccessible Runs: 0 + ============================================================ [2025-10-08 10:32:47: WARNING] [GARBAGE COLLECTOR] WARNING: This will permanently delete stale database entries. Run with --dry-run first to see what would be deleted. @@ -204,6 +270,22 @@ merlin database gc --skip-runs --dry-run merlin database gc --skip-workers --skip-studies ``` +## Best Practices for Distributed Environments + +When working across multiple machines with different filesystem access: + +1. **Run from a machine with broad filesystem access**: For the most comprehensive cleanup, run garbage collection from a machine that has access to the most access shared file systems. + +2. **Use --dry-run first**: Always preview what will be cleaned up before running the actual cleanup: + + ```bash + merlin database gc --dry-run + ``` + +3. **Check inaccessible workspace warnings**: Pay attention to warnings about inaccessible workspaces. If you see many of these, you may need to run garbage collection from a different machine or manually delete certain runs. + +4. **Understand your filesystem topology**: Know which filesystems are shared (e.g., `/p/lustre3`) vs. local (e.g., `/tmp`, `/home`) in your environment. + ## Limitations The garbage collection process: @@ -211,3 +293,5 @@ The garbage collection process: - Only checks for missing workspace directories (it does not validate workspace contents) - Does not clean up data in the workspace directories themselves (only database entries) - Requires read access to the filesystem paths referenced in run entries +- **Cannot verify workspaces on filesystems that are not mounted on the current machine**: Workspaces on inaccessible filesystems are conservatively skipped to avoid incorrectly flagging valid workspaces as stale +- **Does not distinguish between truly deleted workspaces and temporarily inaccessible ones**: If a shared filesystem is temporarily unmounted or experiencing issues, those workspaces will be skipped but not flagged as invalid diff --git a/merlin/db_scripts/garbage_collector.py b/merlin/db_scripts/garbage_collector.py index 3564d7e5..d5fe1ce5 100644 --- a/merlin/db_scripts/garbage_collector.py +++ b/merlin/db_scripts/garbage_collector.py @@ -13,12 +13,15 @@ import logging import os -from typing import Dict, List +import socket +import textwrap +from pathlib import Path +from typing import Dict, List, Union from merlin.db_scripts.entities.db_entity import DatabaseEntity from merlin.db_scripts.merlin_db import MerlinDatabase from merlin.exceptions import RunNotFoundError, StudyNotFoundError, WorkerNotFoundError -from merlin.utils import get_plural_of_entity +from merlin.utils import get_accessible_mounts, get_plural_of_entity LOG = logging.getLogger(__name__) @@ -85,7 +88,13 @@ def __init__(self, merlin_db: MerlinDatabase = None): merlin_db: Optional MerlinDatabase instance. Creates one if not provided. """ self.merlin_db = merlin_db or MerlinDatabase() - self._issues: Dict[str, List[DatabaseEntity]] = {"run": [], "logical_worker": [], "physical_worker": [], "study": []} + self._issues: Dict[str, List[DatabaseEntity]] = { + "run": [], + "logical_worker": [], + "physical_worker": [], + "study": [], + "inaccessible_runs": [], + } def _prompt_for_confirmation(self) -> bool: """ @@ -105,6 +114,44 @@ def _prompt_for_confirmation(self) -> bool: LOG.debug(f"[GARBAGE COLLECTOR] response: {response}") return response in ["yes", "y"] + def _is_workspace_on_accessible_mount(self, workspace: Union[str, Path]) -> bool: + """ + Check if a workspace path is on an accessible mount point (excluding root). + + This purposefully does NOT include '/' as an accessible mount point. We do this + since every workspace is relative to the root filesystem, and we want to + specifically check for other mounted filesystems that may not be accessible. + + Args: + workspace: The workspace path to check. + + Returns: + True if workspace is on an accessible mount (excluding root), False otherwise. + """ + if not isinstance(workspace, Path): + workspace = Path(workspace) + + accessible_mounts = get_accessible_mounts(exclude_root=True) + workspace_path = workspace.resolve() + + # Check if workspace path starts with any accessible mount + for mount in sorted(accessible_mounts, key=lambda p: len(str(p)), reverse=True): + # Sort by length (longest first) to match most specific mount point + try: + workspace_path.relative_to(mount) + # If we get here, workspace is under this mount + return True + except ValueError: + # Not relative to this mount, continue checking + continue + + # If we didn't find any matching mount, it's not accessible or it's on the root filesystem + LOG.warning( + f"[GARBAGE COLLECTOR] Workspace '{workspace}' is either on the root filesystem or does not exist " + f"in a valid mounted file system on current host '{socket.gethostname()}'." + ) + return False + def check_run_workspaces(self): """ Check all runs for valid workspace directories. @@ -115,13 +162,47 @@ def check_run_workspaces(self): LOG.info("[GARBAGE COLLECTOR] Checking run workspaces for validity...") all_runs = self.merlin_db.runs.get_all() + for run in all_runs: workspace = run.get_workspace() - if not os.path.exists(workspace): - LOG.debug(f"[GARBAGE COLLECTOR] Run {run.get_id()} has invalid workspace: {workspace}") + + # Check if workspace is on an accessible NON-ROOT mount (e.g., a network filesystem). + # This will be False for workspaces on the local root filesystem. + is_accessible_mount = self._is_workspace_on_accessible_mount(workspace) + + # Check if the workspace physically exists on the current host. + workspace_exists = os.path.exists(workspace) + + if is_accessible_mount and not workspace_exists: + # Case 1: Workspace is on an accessible NON-ROOT mount (e.g., /mnt/nfs) but is missing. + # This is an invalid workspace issue. + LOG.debug(f"[GARBAGE COLLECTOR] Run {run.get_id()} has invalid workspace on an accessible mount: {workspace}") self._issues["run"].append(run) + elif not is_accessible_mount and not workspace_exists: + # Case 2: Workspace is NOT on a non-root accessible mount AND does not physically exist on current host. + # This indicates the workspace is likely on an inaccessible mount *or* it was a local + # workspace that was deleted, but we treat this as *potentially* inaccessible + # to avoid premature deletion of runs accessible from another host. + LOG.debug( + f"[GARBAGE COLLECTOR] Run '{run.get_id()}' has workspace on potentially inaccessible mount " + f"and does not exist: {workspace}" + ) + self._issues["inaccessible_runs"].append(run) + + # Case 3: Workspace is NOT on a non-root accessible mount, but DOES exist. + # This is the expected state for a valid workspace on the local root filesystem. + # No action needed, it's considered valid. + + # Case 4: is_accessible_mount and workspace_exists: Valid non-root workspace. No action. + LOG.info(f"[GARBAGE COLLECTOR] Found {len(self._issues['run'])} runs with invalid workspaces.") + if self._issues["inaccessible_runs"]: + LOG.warning( + f"[GARBAGE COLLECTOR] Found {len(self._issues['inaccessible_runs'])} runs with workspaces " + f"on file systems not accessible from the current host '{socket.gethostname()}'. Run garbage " + "collection from a machine with access to verify these." + ) def check_orphaned_logical_workers(self): """ @@ -306,6 +387,27 @@ def generate_report(self) -> str: for study in self._issues["study"]: report_lines.append(f" - {study.get_name()}") + report_lines.append("") + + report_lines.append("=" * 60) + + report_lines.append("") + + # Potentially Inaccessible Runs section + report_lines.append(f"Potentially Inaccessible Runs: {len(self._issues['inaccessible_runs'])}") + if self._issues["inaccessible_runs"]: + for run in self._issues["inaccessible_runs"]: + report_lines.append(f" - {run.get_workspace()}") + report_lines.append("") + inaccessible_message = ( + "You may need to re-run garbage collection on a machine that can access these runs, " + "or remove them manually if they are local runs being flagged as inaccessible." + ) + wrapped_message = textwrap.fill(inaccessible_message, width=60) + report_lines.append(wrapped_message) + + report_lines.append("") + report_lines.append("=" * 60) return "\n".join(report_lines) diff --git a/merlin/utils.py b/merlin/utils.py index 8d51d495..a1bc4a27 100644 --- a/merlin/utils.py +++ b/merlin/utils.py @@ -18,8 +18,9 @@ from contextlib import contextmanager from copy import deepcopy from datetime import datetime, timedelta +from pathlib import Path from types import SimpleNamespace -from typing import Any, Callable, Dict, Generator, List, Tuple, Union +from typing import Any, Callable, Dict, Generator, List, Set, Tuple, Union import numpy as np import pkg_resources @@ -465,6 +466,11 @@ def load_array_file(filename: str, ndmin: int = 2) -> np.ndarray: return array +################################# +# File system utility functions # +################################# + + def determine_protocol(fname: str) -> str: """ Determine the file protocol based on the file name extension. @@ -537,6 +543,33 @@ def verify_dirpath(dirpath: str) -> str: return dirpath +def get_accessible_mounts(exclude_root: bool = False) -> Set[Path]: + """ + Get set of mount points that are actually accessible on this machine. + + By default, this likely includes the root filesystem (e.g., '/' or 'C:\\') as an + accessible mount point. You may want to exclude it if you are specifically + interested in other mounted filesystems. + + Args: + exclude_root: If True, exclude the root filesystem from the list of + accessible mounts. + + Returns: + Set of accessible mount point paths. + """ + accessible = set() + + for part in psutil.disk_partitions(all=True): + mountpoint = Path(part.mountpoint) + if exclude_root and mountpoint == Path(mountpoint.anchor): + continue + accessible.add(mountpoint) + + LOG.debug(f"Accessible mounts on {socket.gethostname()}: {accessible}") + return accessible + + @contextmanager def cd(path: str) -> Generator[None, None, None]: # pylint: disable=C0103 """ diff --git a/tests/unit/db_scripts/test_garbage_collector.py b/tests/unit/db_scripts/test_garbage_collector.py index 778be100..caaf0a77 100644 --- a/tests/unit/db_scripts/test_garbage_collector.py +++ b/tests/unit/db_scripts/test_garbage_collector.py @@ -8,13 +8,33 @@ Unit tests for the `merlin/db_scripts/garbage_collector.py` module. """ +import os +from pathlib import Path from unittest.mock import MagicMock import pytest +from _pytest.monkeypatch import MonkeyPatch from pytest_mock import MockerFixture from merlin.db_scripts.garbage_collector import DatabaseGarbageCollector from merlin.exceptions import RunNotFoundError +from tests.fixture_types import FixtureCallable, FixtureStr + + +@pytest.fixture(scope="session") +def garbage_collection_testing_dir(create_testing_dir: FixtureCallable, temp_output_dir: FixtureStr) -> FixtureStr: + """ + Fixture to create a temporary output directory for tests related to testing the + garbage collection workflow. + + Args: + create_testing_dir: A fixture which returns a function that creates the testing directory. + temp_output_dir: The path to the temporary ouptut directory we'll be using for this test run. + + Returns: + The path to the temporary testing directory for garbage collection tests. + """ + return create_testing_dir(temp_output_dir, "garbage_collection_testing") @pytest.fixture @@ -121,7 +141,13 @@ def test_init_with_provided_db(self, mock_db: MagicMock): """ gc = DatabaseGarbageCollector(merlin_db=mock_db) assert gc.merlin_db is mock_db - assert gc._issues == {"run": [], "logical_worker": [], "physical_worker": [], "study": []} + assert gc._issues == { + "run": [], + "logical_worker": [], + "physical_worker": [], + "study": [], + "inaccessible_runs": [], + } def test_init_without_provided_db(self, mock_db: MagicMock): """ @@ -133,7 +159,13 @@ def test_init_without_provided_db(self, mock_db: MagicMock): gc = DatabaseGarbageCollector() mock_db.assert_called_once() assert gc.merlin_db is not None - assert gc._issues == {"run": [], "logical_worker": [], "physical_worker": [], "study": []} + assert gc._issues == { + "run": [], + "logical_worker": [], + "physical_worker": [], + "study": [], + "inaccessible_runs": [], + } class TestPromptForConfirmation: @@ -169,14 +201,343 @@ def test_prompt_for_confirmation_invalid_then_valid(self, mocker: MockerFixture, assert mock_input.call_count == 2 +class TestIsWorkspaceOnAccessibleMount: + """Tests for the _is_workspace_on_accessible_mount method.""" + + def test_workspace_on_specific_mount(self, mocker: MockerFixture, gc: DatabaseGarbageCollector): + """ + Test workspace on a specific non-root mount point. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + """ + workspace = "/mnt/shared/workspace" + + mocker.patch( + "merlin.db_scripts.garbage_collector.get_accessible_mounts", return_value={Path("/mnt/shared"), Path("/home")} + ) + + result = gc._is_workspace_on_accessible_mount(workspace) + + assert result is True + + # TODO is there a way to check which mount point is matched? + def test_workspace_on_longest_matching_mount(self, mocker: MockerFixture, gc: DatabaseGarbageCollector): + """ + Test that most specific (longest) mount point is matched. + + If workspace is /p/lustre3/data/workspace and both /p and /p/lustre3 + are mounted, should match /p/lustre3. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + """ + workspace = "/p/lustre3/data/workspace" + + mocker.patch( + "merlin.db_scripts.garbage_collector.get_accessible_mounts", + return_value={Path("/p"), Path("/p/lustre3"), Path("/home")}, + ) + + result = gc._is_workspace_on_accessible_mount(workspace) + + # Should match, and internally should prefer /p/lustre3 over /p + assert result is True + + def test_workspace_not_on_any_mount(self, mocker: MockerFixture, gc: DatabaseGarbageCollector): + """ + Test workspace that is not on any accessible non-root mount. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + """ + workspace = "/p/lustre3/workspace" + + # Only /home is accessible, not /p/lustre3 + mocker.patch("merlin.db_scripts.garbage_collector.get_accessible_mounts", return_value={Path("/home")}) + + mock_log = mocker.patch("merlin.db_scripts.garbage_collector.LOG") + + result = gc._is_workspace_on_accessible_mount(workspace) + + assert result is False + mock_log.warning.assert_called_once() + warning_msg = str(mock_log.warning.call_args) + assert "root filesystem" in warning_msg and "mounted file system" in warning_msg + + def test_workspace_on_root_filesystem(self, mocker: MockerFixture, gc: DatabaseGarbageCollector): + """ + Test workspace on root filesystem (e.g., /tmp, /var). + + Since get_accessible_mounts excludes root, this should return False. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + """ + workspace = "/tmp/workspace" + + # No specific mount for /tmp + mocker.patch( + "merlin.db_scripts.garbage_collector.get_accessible_mounts", return_value={Path("/home"), Path("/mnt/data")} + ) + + result = gc._is_workspace_on_accessible_mount(workspace) + + assert result is False + + def test_workspace_with_path_object(self, mocker: MockerFixture, gc: DatabaseGarbageCollector): + """ + Test that Path objects are handled correctly. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + """ + workspace = Path("/mnt/shared/workspace") + + mocker.patch("merlin.db_scripts.garbage_collector.get_accessible_mounts", return_value={Path("/mnt/shared")}) + + result = gc._is_workspace_on_accessible_mount(workspace) + + assert result is True + + def test_workspace_with_symlink( + self, mocker: MockerFixture, gc: DatabaseGarbageCollector, garbage_collection_testing_dir: FixtureStr + ): + """ + Test that symlinks are resolved before checking mount points. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + garbage_collection_testing_dir: The path to the temporary output directory where garbage collection + tests will store their results. + """ + gc_testing_dir = Path(garbage_collection_testing_dir) + + # Create a real directory and symlink for testing + real_dir = gc_testing_dir / "real" / "workspace" + real_dir.mkdir(parents=True) + + symlink = gc_testing_dir / "link" + symlink.symlink_to(real_dir.parent) + + workspace = symlink / "workspace" + + # Mock the mount to match the real path + mocker.patch("merlin.db_scripts.garbage_collector.get_accessible_mounts", return_value={gc_testing_dir / "real"}) + + result = gc._is_workspace_on_accessible_mount(workspace) + + assert result is True + + def test_workspace_with_relative_path( + self, + mocker: MockerFixture, + gc: DatabaseGarbageCollector, + garbage_collection_testing_dir: FixtureStr, + monkeypatch: MonkeyPatch, + ): + """ + Test that relative paths are resolved to absolute paths. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + garbage_collection_testing_dir: The path to the temporary output directory where garbage collection + tests will store their results. + monkeypatch: Pytest monkeypatch fixture. + """ + gc_testing_dir = Path(garbage_collection_testing_dir) + + # Create a test directory structure + test_dir = gc_testing_dir / "mnt" / "shared" + test_dir.mkdir(parents=True, exist_ok=True) + + # Change to the workspace parent directory + workspace_dir = test_dir / "workspace" + workspace_dir.mkdir() + + monkeypatch.chdir(test_dir) + + # Use relative path + workspace = "./workspace" + + mocker.patch( + "merlin.db_scripts.garbage_collector.get_accessible_mounts", return_value={gc_testing_dir / "mnt" / "shared"} + ) + + result = gc._is_workspace_on_accessible_mount(workspace) + + assert result is True + + def test_no_accessible_mounts(self, mocker: MockerFixture, gc: DatabaseGarbageCollector): + """ + Test behavior when there are no accessible non-root mounts. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + """ + workspace = "/any/workspace" + + # No mounts at all (empty set) + mocker.patch("merlin.db_scripts.garbage_collector.get_accessible_mounts", return_value=set()) + + mock_log = mocker.patch("merlin.db_scripts.garbage_collector.LOG") + + result = gc._is_workspace_on_accessible_mount(workspace) + + assert result is False + mock_log.warning.assert_called_once() + + def test_workspace_at_mount_root(self, mocker: MockerFixture, gc: DatabaseGarbageCollector): + """ + Test workspace located directly at mount point root. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + """ + workspace = "/mnt/shared" + + mocker.patch("merlin.db_scripts.garbage_collector.get_accessible_mounts", return_value={Path("/mnt/shared")}) + + result = gc._is_workspace_on_accessible_mount(workspace) + + assert result is True + + def test_workspace_similar_but_not_matching_mount(self, mocker: MockerFixture, gc: DatabaseGarbageCollector): + """ + Test workspace with path similar to mount but not under it. + + E.g., /p/lustre2 should not match mount /p/lustre1 + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + """ + workspace = "/p/lustre2/workspace" + + mocker.patch("merlin.db_scripts.garbage_collector.get_accessible_mounts", return_value={Path("/p/lustre1")}) + + result = gc._is_workspace_on_accessible_mount(workspace) + + assert result is False + + def test_get_accessible_mounts_called_with_exclude_root(self, mocker: MockerFixture, gc: DatabaseGarbageCollector): + """ + Test that get_accessible_mounts is called with exclude_root=True. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + """ + workspace = "/mnt/shared/workspace" + + mock_get_mounts = mocker.patch( + "merlin.db_scripts.garbage_collector.get_accessible_mounts", return_value={Path("/mnt/shared")} + ) + + gc._is_workspace_on_accessible_mount(workspace) + + mock_get_mounts.assert_called_once_with(exclude_root=True) + + def test_workspace_with_trailing_slash(self, mocker: MockerFixture, gc: DatabaseGarbageCollector): + """ + Test workspace path with trailing slash is handled correctly. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + """ + workspace = "/mnt/shared/workspace/" # Note trailing slash + + mocker.patch("merlin.db_scripts.garbage_collector.get_accessible_mounts", return_value={Path("/mnt/shared")}) + + result = gc._is_workspace_on_accessible_mount(workspace) + + assert result is True + + def test_workspace_with_dot_segments( + self, mocker: MockerFixture, gc: DatabaseGarbageCollector, garbage_collection_testing_dir: FixtureStr + ): + """ + Test workspace path with .. segment is resolved correctly. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + garbage_collection_testing_dir: The path to the temporary output directory where garbage collection + tests will store their results. + """ + gc_testing_dir = Path(garbage_collection_testing_dir) + + # Create test structure + mount_dir = gc_testing_dir / "mnt" / "shared" + mount_dir.mkdir(parents=True, exist_ok=True) + + # Path with .. segments that resolves to mount_dir + workspace = gc_testing_dir / "mnt" / "other" / ".." / "shared" / "workspace" + + mocker.patch("merlin.db_scripts.garbage_collector.get_accessible_mounts", return_value={mount_dir}) + + result = gc._is_workspace_on_accessible_mount(workspace) + + assert result is True + + class TestCheckRunWorkspaces: """Tests for the check_run_workspaces method.""" - def test_check_with_valid_workspace( + def test_check_with_invalid_workspace_on_accessible_mount( + self, mocker: MockerFixture, gc: DatabaseGarbageCollector, mock_run: MagicMock, mock_db: MagicMock + ): + """ + Test checking runs when workspace is on an accessible mount but doesn't exist. + + Case 1: Workspace is on an accessible NON-ROOT mount (e.g., /mnt/nfs) but is missing. + This is an invalid workspace issue (is_accessible_mount=True but workspace_exists=False). + + Expected: Workspace flagged as invalid. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + mock_run: Mocked run instance. + mock_db: Mocked database instance. + """ + workspace = "/mnt/shared/missing_workspace" + mock_run.get_workspace.return_value = workspace + mock_run.get_id.return_value = "run-123" + mock_db.runs.get_all.return_value = [mock_run] + + mocker.patch.object(gc, "_is_workspace_on_accessible_mount", return_value=True) + mocker.patch("os.path.exists", return_value=False) + + gc.check_run_workspaces() + + assert len(gc._issues["run"]) == 1 + assert gc._issues["run"][0] == mock_run + + assert len(gc._issues["inaccessible_runs"]) == 0 + + def test_check_with_workspace_on_inaccessible_mount( self, mocker: MockerFixture, gc: DatabaseGarbageCollector, mock_run: MagicMock, mock_db: MagicMock ): """ - Test checking runs when workspace exists. + Test checking runs when workspace is on an inaccessible mount. + + Case 2: Workspace is NOT on a non-root accessible mount AND does not physically exist on current host. + This indicates the workspace is likely on an inaccessible mount *or* it was a local workspace that was + deleted, but we treat this as *potentially* inaccessible to avoid premature deletion of runs accessible + from another host (is_accessible_mount=False and workspace_exists=False). + + Expected: Counted as inaccessible, not flagged as invalid. Args: mocker: Pytest mocker fixture. @@ -184,19 +545,76 @@ def test_check_with_valid_workspace( mock_run: Mocked run instance. mock_db: Mocked database instance. """ - mock_run.get_workspace.return_value = "/tmp" + workspace = "/p/lustre3/workspace" + mock_run.get_workspace.return_value = workspace + mock_run.get_id.return_value = "run-123" mock_db.runs.get_all.return_value = [mock_run] - with mocker.patch("os.path.exists", return_value=True): - gc.check_run_workspaces() + mocker.patch.object(gc, "_is_workspace_on_accessible_mount", return_value=False) + mocker.patch("os.path.exists", return_value=False) + + mock_log = mocker.patch("merlin.db_scripts.garbage_collector.LOG") + + gc.check_run_workspaces() + # Should NOT be in issues list for run entries assert len(gc._issues["run"]) == 0 - def test_check_with_invalid_workspace( + # Should be in issues list for inaccessible runs + assert len(gc._issues["inaccessible_runs"]) == 1 + assert gc._issues["inaccessible_runs"][0] == mock_run + + # Should log warning about inaccessible workspaces + mock_log.warning.assert_called() + warning_calls = [str(call) for call in list(mock_log.warning.call_args_list + mock_log.debug.call_args_list)] + assert any("inaccessible" in str(call).lower() for call in warning_calls) + + def test_check_with_valid_workspace_on_root_filesystem( + self, + mocker: MockerFixture, + gc: DatabaseGarbageCollector, + mock_run: MagicMock, + mock_db: MagicMock, + garbage_collection_testing_dir: FixtureStr, + ): + """ + Test checking runs when workspace exists on root filesystem (e.g., /tmp). + + Case 3: Workspace is NOT on a non-root accessible mount, but DOES exist. This is the expected state for + a valid workspace on the local root filesystem. No action needed, it's considered valid + (is_accessible_mount=False but workspace_exists=True). + + Expected: No issues flagged (valid local workspace). + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + mock_run: Mocked run instance. + mock_db: Mocked database instance. + garbage_collection_testing_dir: The path to the temporary output directory where garbage collection + tests will store files associated with this test. + """ + workspace = os.path.join(garbage_collection_testing_dir, "local_workspace") + os.makedirs(workspace, exist_ok=True) + mock_run.get_workspace.return_value = workspace + mock_run.get_id.return_value = "run-123" + mock_db.runs.get_all.return_value = [mock_run] + + gc.check_run_workspaces() + + assert len(gc._issues["run"]) == 0 + assert len(gc._issues["inaccessible_runs"]) == 0 + + def test_check_with_valid_workspace_on_accessible_mount( self, mocker: MockerFixture, gc: DatabaseGarbageCollector, mock_run: MagicMock, mock_db: MagicMock ): """ - Test checking runs when workspace doesn't exist. + Test checking runs when workspace exists on an accessible non-root mount. + + Case 4: Mount is accessible and the workspace exists. No action needed (is_accessible_mount=True and + workspace_exists=True). + + Expected: No issues flagged. Args: mocker: Pytest mocker fixture. @@ -204,40 +622,146 @@ def test_check_with_invalid_workspace( mock_run: Mocked run instance. mock_db: Mocked database instance. """ - mock_run.get_workspace.return_value = "/nonexistent/path" + workspace = "/mnt/shared/workspace" + mock_run.get_workspace.return_value = workspace + mock_run.get_id.return_value = "run-123" mock_db.runs.get_all.return_value = [mock_run] - with mocker.patch("os.path.exists", return_value=False): - gc.check_run_workspaces() + mocker.patch.object(gc, "_is_workspace_on_accessible_mount", return_value=True) + mocker.patch("os.path.exists", return_value=True) + + gc.check_run_workspaces() + + assert len(gc._issues["run"]) == 0 + assert len(gc._issues["inaccessible_runs"]) == 0 + + def test_check_with_multiple_runs_mixed_cases( + self, + mocker: MockerFixture, + gc: DatabaseGarbageCollector, + mock_db: MagicMock, + garbage_collection_testing_dir: FixtureStr, + ): + """ + Test checking multiple runs covering all four cases. + + Args: + mocker: Pytest mocker fixture. + gc: DatabaseGarbageCollector instance. + mock_db: Mocked database instance. + garbage_collection_testing_dir: The path to the temporary output directory where garbage collection + tests will store files associated with this test. + """ + # Create real directories for valid workspaces + valid_local_path = os.path.join(garbage_collection_testing_dir, "valid_local") + valid_mount_path = os.path.join(garbage_collection_testing_dir, "mnt", "shared", "valid") + os.makedirs(valid_local_path, exist_ok=True) + os.makedirs(valid_mount_path, exist_ok=True) + + # Case 1: Invalid workspace on accessible mount (should be flagged) + invalid_accessible = MagicMock() + invalid_accessible.get_workspace.return_value = os.path.join( + garbage_collection_testing_dir, "mnt", "shared", "invalid" + ) + invalid_accessible.get_id.return_value = "run-invalid-accessible" + + # Case 2: Workspace on inaccessible mount (should not be flagged) + inaccessible = MagicMock() + inaccessible.get_workspace.return_value = "/p/lustre3/workspace" + inaccessible.get_id.return_value = "run-inaccessible" + + # Case 3: Valid local workspace on root filesystem + valid_local = MagicMock() + valid_local.get_workspace.return_value = valid_local_path + valid_local.get_id.return_value = "run-valid-local" + + # Case 4: Valid workspace on accessible mount + valid_accessible = MagicMock() + valid_accessible.get_workspace.return_value = valid_mount_path + valid_accessible.get_id.return_value = "run-valid-accessible" + + mock_db.runs.get_all.return_value = [invalid_accessible, inaccessible, valid_local, valid_accessible] + + # Mock _is_workspace_on_accessible_mount based on workspace path + def is_accessible_side_effect(workspace): + # Workspaces under garbage_collection_testing_dir/mnt/shared are accessible + # /p/lustre3 is inaccessible (real shared mount) + workspace_str = str(workspace) + return workspace_str.startswith( + os.path.join(garbage_collection_testing_dir, "mnt", "shared") + ) and not workspace_str.startswith("/p/lustre3") + mocker.patch.object(gc, "_is_workspace_on_accessible_mount", side_effect=is_accessible_side_effect) + + gc.check_run_workspaces() + + # Only invalid_accessible should be in 'run' issues (Case 1) assert len(gc._issues["run"]) == 1 - assert gc._issues["run"][0] == mock_run + assert gc._issues["run"][0] == invalid_accessible + + # Only inaccessible should be in 'inaccessible_runs' issues (Case 2) + assert len(gc._issues["inaccessible_runs"]) == 1 + assert gc._issues["inaccessible_runs"][0] == inaccessible - def test_check_with_multiple_runs(self, mocker: MockerFixture, gc: DatabaseGarbageCollector, mock_db: MagicMock): + def test_check_logs_inaccessible_count(self, mocker: MockerFixture, gc: DatabaseGarbageCollector, mock_db: MagicMock): """ - Test checking multiple runs with mixed validity. + Test that inaccessible workspace count is logged correctly. Args: mocker: Pytest mocker fixture. gc: DatabaseGarbageCollector instance. mock_db: Mocked database instance. """ - valid_run = MagicMock() - valid_run.get_workspace.return_value = "/valid/path" + run1 = MagicMock() + run1.get_workspace.return_value = "/p/lustre3/ws1" + run1.get_id.return_value = "run-1" - invalid_run = MagicMock() - invalid_run.get_workspace.return_value = "/invalid/path" + run2 = MagicMock() + run2.get_workspace.return_value = "/p/lustre3/ws2" + run2.get_id.return_value = "run-2" - mock_db.runs.get_all.return_value = [valid_run, invalid_run] + run3 = MagicMock() + run3.get_workspace.return_value = "/mnt/shared/ws3" + run3.get_id.return_value = "run-3" - def path_exists_side_effect(path: str): - return path == "/valid/path" + mock_db.runs.get_all.return_value = [run1, run2, run3] - with mocker.patch("os.path.exists", side_effect=path_exists_side_effect): - gc.check_run_workspaces() + mocker.patch.object(gc, "_is_workspace_on_accessible_mount", return_value=False) + mocker.patch("os.path.exists", return_value=False) - assert len(gc._issues["run"]) == 1 - assert gc._issues["run"][0] == invalid_run + mock_log = mocker.patch("merlin.db_scripts.garbage_collector.LOG") + + gc.check_run_workspaces() + + # All 3 should be counted as inaccessible but not invalid (Case 2) + assert len(gc._issues["run"]) == 0 + assert len(gc._issues["inaccessible_runs"]) == 3 + assert gc._issues["inaccessible_runs"][0] == run1 + assert gc._issues["inaccessible_runs"][1] == run2 + assert gc._issues["inaccessible_runs"][2] == run3 + + # Check that warning about 3 inaccessible workspaces was logged + log_calls = mock_log.warning.call_args_list + assert len(log_calls) > 0 + + # Find the summary warning (should be the last one) + summary_warning = str(log_calls[-1]) + assert "3" in summary_warning + + def test_check_with_no_runs(self, gc: DatabaseGarbageCollector, mock_db: MagicMock): + """ + Test checking when there are no runs in the database. + + Args: + gc: DatabaseGarbageCollector instance. + mock_db: Mocked database instance. + """ + mock_db.runs.get_all.return_value = [] + + gc.check_run_workspaces() + + assert len(gc._issues["run"]) == 0 + assert len(gc._issues["inaccessible_runs"]) == 0 class TestCheckOrphanedLogicalWorkers: @@ -640,6 +1164,7 @@ def test_generate_report_with_no_issues(self, gc: DatabaseGarbageCollector): assert "Orphaned Logical Workers: 0" in report assert "Orphaned Physical Workers: 0" in report assert "Empty Studies: 0" in report + assert "Inaccessible Runs: 0" in report def test_generate_report_with_issues( self, @@ -663,6 +1188,9 @@ def test_generate_report_with_issues( gc._issues["logical_worker"] = [mock_logical_worker] gc._issues["physical_worker"] = [mock_physical_worker] gc._issues["study"] = [mock_study] + mock_inaccessible_run = mock_run.copy() + mock_inaccessible_run.get_workspace.return_value = "/path/to/inaccessible/workspace" + gc._issues["inaccessible_runs"] = [mock_inaccessible_run] report = gc.generate_report() @@ -676,6 +1204,8 @@ def test_generate_report_with_issues( assert "hostname" in report assert "Empty Studies: 1" in report assert "test-study" in report + assert "Inaccessible Runs: 1" in report + assert "/path/to/inaccessible/workspace" in report class TestScan: @@ -921,6 +1451,7 @@ def test_full_garbage_collection_workflow(self, mocker: MockerFixture, gc: Datab # Run garbage collection mocker.patch("os.path.exists", return_value=False) + mocker.patch.object(gc, "_is_workspace_on_accessible_mount", return_value=True) gc.scan_and_clean(force=True) # Verify deletions occurred in correct order diff --git a/tests/unit/utils/test_file_sys_utils.py b/tests/unit/utils/test_file_sys_utils.py new file mode 100644 index 00000000..fe1f1d5b --- /dev/null +++ b/tests/unit/utils/test_file_sys_utils.py @@ -0,0 +1,216 @@ +############################################################################## +# Copyright (c) Lawrence Livermore National Security, LLC and other Merlin +# Project developers. See top-level LICENSE and COPYRIGHT files for dates and +# other details. No copyright assignment is required to contribute to Merlin. +############################################################################## + +""" +Tests for file system utility functions of the `merlin/utils.py` module. +""" + +from pathlib import Path + +from pytest_mock import MockerFixture + +from merlin.utils import get_accessible_mounts + + +class TestGetAccessibleMounts: + """Tests for the get_accessible_mounts function.""" + + def test_includes_root_by_default(self, mocker: MockerFixture): + """ + Test that root filesystem ('/') is included when exclude_root=False (default behavior). + + Args: + mocker: Pytest mocker fixture. + """ + mock_partitions = [ + mocker.Mock(mountpoint="/"), + mocker.Mock(mountpoint="/home"), + ] + + mocker.patch("psutil.disk_partitions", return_value=mock_partitions) + + result = get_accessible_mounts(exclude_root=False) + + assert Path("/") in result + assert Path("/home") in result + assert len(result) == 2 + + def test_excludes_root_when_requested(self, mocker: MockerFixture): + """ + Test that root filesystem is excluded when exclude_root=True. + + Args: + mocker: Pytest mocker fixture. + """ + mock_partitions = [ + mocker.Mock(mountpoint="/"), + mocker.Mock(mountpoint="/home"), + mocker.Mock(mountpoint="/mnt/data"), + ] + + mocker.patch("psutil.disk_partitions", return_value=mock_partitions) + + result = get_accessible_mounts(exclude_root=True) + + assert Path("/") not in result + assert Path("/home") in result + assert Path("/mnt/data") in result + assert len(result) == 2 + + def test_handles_empty_partitions(self, mocker: MockerFixture): + """ + Test behavior when no partitions are found. + + Args: + mocker: Pytest mocker fixture. + """ + mocker.patch("psutil.disk_partitions", return_value=[]) + + result = get_accessible_mounts() + + assert result == set() + assert len(result) == 0 + + def test_handles_only_root_partition(self, mocker: MockerFixture): + """ + Test behavior when only root partition exists. + + Args: + mocker: Pytest mocker fixture. + """ + mock_partitions = [mocker.Mock(mountpoint="/")] + + mocker.patch("psutil.disk_partitions", return_value=mock_partitions) + + result = get_accessible_mounts() + + assert result == {Path("/")} + assert len(result) == 1 + + def test_handles_only_root_partition_with_exclude(self, mocker: MockerFixture): + """ + Test that excluding root when only root exists returns empty set. + + Args: + mocker: Pytest mocker fixture. + """ + mock_partitions = [mocker.Mock(mountpoint="/")] + + mocker.patch("psutil.disk_partitions", return_value=mock_partitions) + + result = get_accessible_mounts(exclude_root=True) + + assert result == set() + assert len(result) == 0 + + def test_handles_duplicate_mount_points(self, mocker: MockerFixture): + """ + Test that duplicate mount points are deduplicated (set behavior). + + Args: + mocker: Pytest mocker fixture. + """ + # Some systems might have multiple partitions with same mountpoint + mock_partitions = [ + mocker.Mock(mountpoint="/home"), + mocker.Mock(mountpoint="/home"), # Duplicate + mocker.Mock(mountpoint="/mnt/data"), + ] + + mocker.patch("psutil.disk_partitions", return_value=mock_partitions) + + result = get_accessible_mounts() + + assert len(result) == 2 # /home should only appear once + assert Path("/home") in result + assert Path("/mnt/data") in result + + def test_handles_mount_points_with_spaces(self, mocker: MockerFixture): + """ + Test handling of mount points with spaces in the path. + + Args: + mocker: Pytest mocker fixture. + """ + mock_partitions = [ + mocker.Mock(mountpoint="/mnt/my mount"), + mocker.Mock(mountpoint="/mnt/another mount point"), + ] + + mocker.patch("psutil.disk_partitions", return_value=mock_partitions) + + result = get_accessible_mounts() + + assert Path("/mnt/my mount") in result + assert Path("/mnt/another mount point") in result + + def test_handles_nested_mount_points(self, mocker: MockerFixture): + """ + Test that nested mount points are all included. + + Args: + mocker: Pytest mocker fixture. + """ + mock_partitions = [ + mocker.Mock(mountpoint="/"), + mocker.Mock(mountpoint="/p"), + mocker.Mock(mountpoint="/p/lustre1"), + mocker.Mock(mountpoint="/p/lustre2"), + mocker.Mock(mountpoint="/p/lustre3"), + ] + + mocker.patch("psutil.disk_partitions", return_value=mock_partitions) + + result = get_accessible_mounts() + + assert Path("/") in result + assert Path("/p") in result + assert Path("/p/lustre1") in result + assert Path("/p/lustre2") in result + assert Path("/p/lustre3") in result + assert len(result) == 5 + + def test_exclude_root_with_nested_mounts(self, mocker: MockerFixture): + """ + Test that exclude_root only excludes '/', not nested mounts. + + Args: + mocker: Pytest mocker fixture. + """ + mock_partitions = [ + mocker.Mock(mountpoint="/"), + mocker.Mock(mountpoint="/p"), + mocker.Mock(mountpoint="/p/lustre3"), + ] + + mocker.patch("psutil.disk_partitions", return_value=mock_partitions) + + result = get_accessible_mounts(exclude_root=True) + + assert Path("/") not in result + assert Path("/p") in result + assert Path("/p/lustre3") in result + assert len(result) == 2 + + def test_handles_windows_style_mount_points(self, mocker: MockerFixture): + """ + Test handling of Windows-style drive letters (for cross-platform compatibility). + + Args: + mocker: Pytest mocker fixture. + """ + mock_partitions = [ + mocker.Mock(mountpoint="C:\\"), + mocker.Mock(mountpoint="D:\\"), + ] + + mocker.patch("psutil.disk_partitions", return_value=mock_partitions) + + result = get_accessible_mounts() + + # Path will handle platform-specific paths + assert Path("C:\\") in result or Path("C:/") in result + assert len(result) == 2