Vibhav dev by VibhavSetlur · Pull Request #11 · kbase/tablescanner

VibhavSetlur · 2026-02-11T13:29:18Z

This pull request refactors the API endpoints for multi-database support, improves efficiency by switching from full-object downloads to targeted single-database downloads, and disables a problematic library in favor of direct API calls with proper timeout handling. The changes also update naming conventions for clarity and add logging to help with debugging and performance tracking.

API endpoint refactoring and parameter changes:

Changed all affected endpoints to use the upa query parameter instead of a path parameter for workspace object references, making the API more consistent and easier to use. [1] [2] [3] [4] [5]
Updated endpoint documentation to clarify the use of upa and provide examples for multi-database objects. [1] [2] [3]

Performance and efficiency improvements:

Replaced bulk database downloads (download_all_pangenome_dbs) with targeted single-database downloads (download_db_multi), reducing unnecessary data transfer and improving performance. [1] [2] [3]
Updated helper functions and imports throughout the codebase to use the new download_db and download_db_multi functions. [1] [2] [3] [4] [5]

Workspace client reliability and logging:

Disabled the use of KBUtilLib in the KBaseClient class due to unreliable timeout handling, switching to direct API calls with explicit timeouts and improved error handling. [1] [2]
Added detailed logging and timing information to workspace client methods and database download functions for easier debugging and performance analysis. [1] [2] [3] [4]

Naming and documentation updates:

Renamed download_pangenome_db to download_db and updated documentation/comments to reflect the broader multi-database support. [1] [2] [3]
Clarified terminology in docstrings and comments, replacing "pangenome" with "database" where appropriate. [1] [2]

Bug fixes and minor improvements:

Fixed a bug in the response source reporting for table queries, ensuring the correct cache/download status is shown.

These changes collectively make the API more robust, efficient, and easier to use for multi-database scenarios.

…path parsing issues

Copilot

Pull request overview

Refactors TableScanner’s multi-database (v2.1) API to use upa as a query parameter (avoiding UPA path parsing issues), improves performance by downloading only the required database for /db/... endpoints, and updates the Workspace client to rely on direct API calls with explicit timeouts and additional logging.

Changes:

Switched multi-db endpoints from path-based UPA (/object/{ws_ref}/...) to query-param UPA (?upa=...) and updated docs accordingly.
Introduced targeted single-database download (download_db_multi) to avoid downloading all DBs for per-DB endpoints.
Disabled KBUtilLib usage in the Workspace client and added request timing logs.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
tests/integration/test_multi_database_routes.py	Adds integration tests for route registration and UPA query parsing; currently needs adjustments to avoid external KBase calls and to cover `/db/...` endpoints with seeded cache.
docs/CONTRIBUTING.md	Updates example route path but leaves outdated parameters/helpers in the snippet.
docs/ARCHITECTURE.md	Updates documented multi-db routes to the new query-parameter UPA format.
docs/API.md	Updates public API documentation and examples for multi-db endpoints using `upa` query parameter.
app/utils/workspace.py	Disables KBUtilLib path, adds timing logs, renames/introduces new download helpers including `download_db_multi`.
app/utils/init.py	Updates exported workspace download helper name to `download_db`.
app/services/db_helper.py	Updates helper to use `download_db` for single-db object resolution.
app/routes.py	Refactors multi-db routes to accept `upa` via query param and uses `download_db_multi` for per-db endpoints.

Comments suppressed due to low confidence (1)

docs/CONTRIBUTING.md:124

This example was partially updated to the new /db/{db_name}/tables route, but the snippet still shows ws_ref as a function parameter and calls download_all_pangenome_dbs. Update the example to use upa as a Query(...) parameter and reflect the new helpers (download_db_multi / download_multi_dbs) so contributors don’t copy a broken pattern.

@router.get("/db/{db_name}/tables")
# Use query parameter for UPA: ?upa={ws_ref}
async def list_tables_in_database(
    ws_ref: str,
    db_name: str,
    authorization: str | None = Header(None),
):
    # Download all databases
    db_infos = await run_sync_in_thread(
        download_all_pangenome_dbs, ws_ref, token, cache_dir, kb_env
    )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-11T13:38:43Z

tests/integration/test_multi_database_routes.py

+    def setUp(self):
+        self.client = TestClient(app)
+        self.client.headers["Authorization"] = "Bearer dummy_token"
+


TestUPAPathEncoding.setUp sets a dummy Authorization header, which causes requests to proceed into the handler and attempt real Workspace/Shock calls (making the tests flaky and contradicting the module docstring about no KBase connectivity). To keep this purely local, omit auth so the endpoint returns 401 (still proving routing/query parsing works), or mock/patch the download/workspace client calls.

Copilot · 2026-02-11T13:38:43Z

tests/integration/test_multi_database_routes.py

+def _seed_multi_db_cache(cache_dir: Path, upa: str, db_name: str) -> Path:
+    """
+    Place a test database where `download_all_pangenome_dbs` would cache it.
+    Layout: {cache_dir}/{safe_upa}/{db_name}/tables.db
+    """
+    safe = upa.replace("/", "_").replace(":", "_").replace(" ", "_")
+    target_dir = cache_dir / safe / db_name
+    db_path = target_dir / "tables.db"
+    _create_test_db(db_path)
+    return db_path


The file defines _seed_multi_db_cache, but there are no assertions that exercise the new multi-db endpoints (/databases, /db/{db_name}/tables, /db/{db_name}/tables/{table}/data) against a locally seeded multi-db cache layout. Adding tests that seed {cache_dir}/{safe_upa}/{db_name}/tables.db and then call the /db/... endpoints would cover the new download_db_multi fast-path and avoid relying on external KBase services.

Copilot · 2026-02-11T13:38:43Z

app/routes.py

 from app.utils.workspace import (
-    download_pangenome_db,
-    download_all_pangenome_dbs,
+    download_db,
+    download_multi_dbs,
+    download_db_multi,
    get_object_type,
 )


download_db is imported here but doesn’t appear to be used anywhere in this module (only download_multi_dbs / download_db_multi are referenced). Consider removing the unused import to avoid lint/CI failures and keep the dependency surface minimal.

Copilot · 2026-02-11T13:38:44Z

app/utils/workspace.py

+    base_dir = get_upa_cache_path(cache_dir, berdl_table_id)
+
+    # Fast path: if already cached, return without hitting Workspace/Shock
+    db_dir = base_dir / db_name
+    db_path = db_dir / "tables.db"


download_db_multi uses the user-supplied db_name directly in filesystem paths (base_dir / db_name). Since db_name comes from a path parameter, this allows path traversal (e.g. ../..) and writing outside the cache directory. Sanitize/validate db_name (e.g. with sanitize_id) and use the sanitized value for the directory name; ideally reject if sanitization changes the input to avoid ambiguous lookups.

Copilot · 2026-02-11T13:38:44Z

app/utils/workspace.py

+        # KBUtilLib can hang indefinitely and doesn't respect timeouts
+        logger.info(f"Using direct API calls (KBUtilLib disabled) for {self.kb_env}")
+        self._use_kbutillib = False
+        return


_init_client returns early, leaving the remainder of the method permanently unreachable. If KBUtilLib is intentionally disabled, consider deleting the dead code entirely or gating it behind a configuration/feature flag so the control flow is clear and future re-enablement is straightforward.

Suggested change

return

Copilot · 2026-02-11T13:38:44Z

app/routes.py

            schemas=schemas,
            total_rows=total_rows,
-            source="Cache" if target_db["db_path"].exists() else "Downloaded",
+            source="Cache" if db_path.exists() else "Downloaded",
            api_version="2.1"


source is computed using db_path.exists() after download_db_multi returns. At that point the file will exist both for cache hits and fresh downloads, so this will always report "Cache" and won’t reflect whether the request actually triggered a download. Track existence before calling download_db_multi (or have it return a was_cached flag) and set source based on that.

Copilot · 2026-02-11T13:38:44Z

app/routes.py

 # =============================================================================
 # MULTI-DATABASE ENDPOINTS (Path-based routing)
 # /object/{ws_ref}/databases - List all databases in an object
 # /object/{ws_ref}/db/{db_name}/tables - List tables in a specific database
 # /object/{ws_ref}/db/{db_name}/tables/{table}/data - Query data from specific DB
 # =============================================================================


The section header comment still says "Path-based routing" and lists the old /object/{ws_ref}/... routes, but the endpoints below now use upa as a query parameter. Update/remove the comment block to avoid misleading documentation for future maintainers.

Copilot · 2026-02-11T13:38:45Z

app/utils/workspace.py

+        # Original KBUtilLib code (disabled):
+        # try:
+        #     if not HAS_KBUTILLIB:
+        #         raise ImportError("KBUtilLib not found")
+        #     ...
+        # except Exception as e:
+        #     logger.warning(f"KBUtilLib not available: {e}. Using fallback.")
+        #     self._use_kbutillib = False


This comment appears to contain commented-out code.

Suggested change

# Original KBUtilLib code (disabled):

# try:

# if not HAS_KBUTILLIB:

# raise ImportError("KBUtilLib not found")

# ...

# except Exception as e:

# logger.warning(f"KBUtilLib not available: {e}. Using fallback.")

# self._use_kbutillib = False

# Historical note: the original implementation attempted to initialize

# KBUtilLib via KBWSUtils/NotebookUtils and, if that failed (for example,

# when KBUtilLib was not available), it logged a warning and fell back to

# direct API calls. That logic has been intentionally disabled because the

# library can hang indefinitely and does not respect timeouts.

Copilot · 2026-02-11T13:38:45Z

app/utils/workspace.py

+
+        # Original KBUtilLib code (disabled):
+        # try:
+        #     if not HAS_KBUTILLIB:
+        #         raise ImportError("KBUtilLib not found")
+        #     ...
+        # except Exception as e:
+        #     logger.warning(f"KBUtilLib not available: {e}. Using fallback.")
+        #     self._use_kbutillib = False



This comment appears to contain commented-out code.

Suggested change

# Original KBUtilLib code (disabled):

# try:

# if not HAS_KBUTILLIB:

# raise ImportError("KBUtilLib not found")

# ...

# except Exception as e:

# logger.warning(f"KBUtilLib not available: {e}. Using fallback.")

# self._use_kbutillib = False

Copilot · 2026-02-11T13:38:45Z

app/routes.py

+    import time
+    logger.info(f"[list_databases_in_object] Starting for UPA={upa}, kb_env={kb_env}")
+    start_time = time.time()


Variable start_time is not used.

Suggested change

import time

logger.info(f"[list_databases_in_object] Starting for UPA={upa}, kb_env={kb_env}")

start_time = time.time()

logger.info(f"[list_databases_in_object] Starting for UPA={upa}, kb_env={kb_env}")

… code, and improve source reporting

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-11T13:54:23Z

tests/integration/test_multi_database_routes.py

+        # Seed a single-db cache (the /db/ routes still call download_all_pangenome_dbs
+        # which needs KBase, so we test the single-object endpoints with cached data)


The comment mentions download_all_pangenome_dbs but this function has been renamed to download_multi_dbs in this PR. However, the comment appears to be incorrect about the behavior: with this PR, the /db/ routes now use download_db_multi which downloads only a single database, not all databases. Update the comment to reflect the actual behavior.

Suggested change

# Seed a single-db cache (the /db/ routes still call download_all_pangenome_dbs

# which needs KBase, so we test the single-object endpoints with cached data)

# Seed a single-db cache. The /db/ routes use download_db_multi, which

# downloads a database from KBase, so here we test single-object endpoints

# that operate on locally cached data instead.

Copilot · 2026-02-11T13:54:24Z

docs/CONTRIBUTING.md

+@router.get("/db/{db_name}/tables")
+# Use query parameter for UPA: ?upa={ws_ref}


The code example is inconsistent with the actual implementation after this PR's changes. The example should be updated to:

Use upa as a Query parameter instead of ws_ref as a Path parameter

Call download_db_multi instead of download_all_pangenome_dbs (which has been renamed to download_multi_dbs and is no longer the recommended approach for single-database access)

Remove the manual search for the target database, since download_db_multi handles that directly

The updated example should match the pattern used in the actual route handler in app/routes.py.

Copilot · 2026-02-11T13:54:24Z

app/utils/workspace.py

+    # Sanitize db_name to prevent path traversal (e.g., "../../etc")
+    safe_db_name = sanitize_id(db_name)
+
+    # Fast path: if already cached, return without hitting Workspace/Shock
+    db_dir = base_dir / safe_db_name


Inconsistent path sanitization between download_db_multi and download_multi_dbs. The download_db_multi function sanitizes the database name for the cache path (line 645-648), but download_multi_dbs uses the unsanitized database name directly (workspace.py line 567). This means if a database name contains characters that are sanitized (like colons or slashes), the two functions will create different cache directory structures, leading to cache misses and duplicate downloads. Either both functions should sanitize, or neither should. For consistency and security, both should sanitize using sanitize_id(db_name).

Suggested change

# Sanitize db_name to prevent path traversal (e.g., "../../etc")

safe_db_name = sanitize_id(db_name)

# Fast path: if already cached, return without hitting Workspace/Shock

db_dir = base_dir / safe_db_name

# Validate db_name to prevent path traversal (e.g., "../../etc")

if Path(db_name).name != db_name or "/" in db_name or "\\" in db_name:

raise ValueError(f"Invalid database name: {db_name!r}")

# Fast path: if already cached, return without hitting Workspace/Shock

db_dir = base_dir / db_name

Copilot · 2026-02-11T13:54:24Z

tests/integration/test_multi_database_routes.py

+
+def _seed_multi_db_cache(cache_dir: Path, upa: str, db_name: str) -> Path:
+    """
+    Place a test database where `download_all_pangenome_dbs` would cache it.


The comment mentions download_all_pangenome_dbs but this function has been renamed to download_multi_dbs in this PR. Update the comment to reflect the new function name.

Suggested change

Place a test database where `download_all_pangenome_dbs` would cache it.

Place a test database where `download_multi_dbs` would cache it.

VibhavSetlur added 2 commits February 9, 2026 12:58

fix: use query params for UPA in multi-database endpoints to prevent …

e667d58

…path parsing issues

fix: use query params for UPA in multi-db routes and add tests

78e0c85

VibhavSetlur requested a review from Copilot February 11, 2026 13:33

Copilot started reviewing on behalf of VibhavSetlur February 11, 2026 13:33 View session

VibhavSetlur self-assigned this Feb 11, 2026

Copilot AI reviewed Feb 11, 2026

View reviewed changes

Address Copilot code review findings: Fix path traversal, remove dead…

40b60d5

… code, and improve source reporting

VibhavSetlur requested a review from Copilot February 11, 2026 13:47

Copilot started reviewing on behalf of VibhavSetlur February 11, 2026 13:47 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

Update test comments to reflect download_multi_dbs rename

26e2a34

VibhavSetlur merged commit ccf7352 into develop Feb 11, 2026
6 checks passed

		# Seed a single-db cache (the /db/ routes still call download_all_pangenome_dbs
		# which needs KBase, so we test the single-object endpoints with cached data)

-        # Seed a single-db cache (the /db/ routes still call download_all_pangenome_dbs
-        # which needs KBase, so we test the single-object endpoints with cached data)
+        # Seed a single-db cache. The /db/ routes use download_db_multi, which
+        # downloads a database from KBase, so here we test single-object endpoints
+        # that operate on locally cached data instead.

		@router.get("/db/{db_name}/tables")
		# Use query parameter for UPA: ?upa={ws_ref}

	Place a test database where `download_all_pangenome_dbs` would cache it.
	Place a test database where `download_multi_dbs` would cache it.

Conversation

VibhavSetlur commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants