Conversation
There was a problem hiding this comment.
Pull request overview
Refactors TableScanner’s multi-database (v2.1) API to use upa as a query parameter (avoiding UPA path parsing issues), improves performance by downloading only the required database for /db/... endpoints, and updates the Workspace client to rely on direct API calls with explicit timeouts and additional logging.
Changes:
- Switched multi-db endpoints from path-based UPA (
/object/{ws_ref}/...) to query-param UPA (?upa=...) and updated docs accordingly. - Introduced targeted single-database download (
download_db_multi) to avoid downloading all DBs for per-DB endpoints. - Disabled KBUtilLib usage in the Workspace client and added request timing logs.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/integration/test_multi_database_routes.py | Adds integration tests for route registration and UPA query parsing; currently needs adjustments to avoid external KBase calls and to cover /db/... endpoints with seeded cache. |
| docs/CONTRIBUTING.md | Updates example route path but leaves outdated parameters/helpers in the snippet. |
| docs/ARCHITECTURE.md | Updates documented multi-db routes to the new query-parameter UPA format. |
| docs/API.md | Updates public API documentation and examples for multi-db endpoints using upa query parameter. |
| app/utils/workspace.py | Disables KBUtilLib path, adds timing logs, renames/introduces new download helpers including download_db_multi. |
| app/utils/init.py | Updates exported workspace download helper name to download_db. |
| app/services/db_helper.py | Updates helper to use download_db for single-db object resolution. |
| app/routes.py | Refactors multi-db routes to accept upa via query param and uses download_db_multi for per-db endpoints. |
Comments suppressed due to low confidence (1)
docs/CONTRIBUTING.md:124
- This example was partially updated to the new
/db/{db_name}/tablesroute, but the snippet still showsws_refas a function parameter and callsdownload_all_pangenome_dbs. Update the example to useupaas aQuery(...)parameter and reflect the new helpers (download_db_multi/download_multi_dbs) so contributors don’t copy a broken pattern.
@router.get("/db/{db_name}/tables")
# Use query parameter for UPA: ?upa={ws_ref}
async def list_tables_in_database(
ws_ref: str,
db_name: str,
authorization: str | None = Header(None),
):
# Download all databases
db_infos = await run_sync_in_thread(
download_all_pangenome_dbs, ws_ref, token, cache_dir, kb_env
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def setUp(self): | ||
| self.client = TestClient(app) | ||
| self.client.headers["Authorization"] = "Bearer dummy_token" | ||
|
|
There was a problem hiding this comment.
TestUPAPathEncoding.setUp sets a dummy Authorization header, which causes requests to proceed into the handler and attempt real Workspace/Shock calls (making the tests flaky and contradicting the module docstring about no KBase connectivity). To keep this purely local, omit auth so the endpoint returns 401 (still proving routing/query parsing works), or mock/patch the download/workspace client calls.
| def _seed_multi_db_cache(cache_dir: Path, upa: str, db_name: str) -> Path: | ||
| """ | ||
| Place a test database where `download_all_pangenome_dbs` would cache it. | ||
| Layout: {cache_dir}/{safe_upa}/{db_name}/tables.db | ||
| """ | ||
| safe = upa.replace("/", "_").replace(":", "_").replace(" ", "_") | ||
| target_dir = cache_dir / safe / db_name | ||
| db_path = target_dir / "tables.db" | ||
| _create_test_db(db_path) | ||
| return db_path |
There was a problem hiding this comment.
The file defines _seed_multi_db_cache, but there are no assertions that exercise the new multi-db endpoints (/databases, /db/{db_name}/tables, /db/{db_name}/tables/{table}/data) against a locally seeded multi-db cache layout. Adding tests that seed {cache_dir}/{safe_upa}/{db_name}/tables.db and then call the /db/... endpoints would cover the new download_db_multi fast-path and avoid relying on external KBase services.
| from app.utils.workspace import ( | ||
| download_pangenome_db, | ||
| download_all_pangenome_dbs, | ||
| download_db, | ||
| download_multi_dbs, | ||
| download_db_multi, | ||
| get_object_type, | ||
| ) |
There was a problem hiding this comment.
download_db is imported here but doesn’t appear to be used anywhere in this module (only download_multi_dbs / download_db_multi are referenced). Consider removing the unused import to avoid lint/CI failures and keep the dependency surface minimal.
| base_dir = get_upa_cache_path(cache_dir, berdl_table_id) | ||
|
|
||
| # Fast path: if already cached, return without hitting Workspace/Shock | ||
| db_dir = base_dir / db_name | ||
| db_path = db_dir / "tables.db" |
There was a problem hiding this comment.
download_db_multi uses the user-supplied db_name directly in filesystem paths (base_dir / db_name). Since db_name comes from a path parameter, this allows path traversal (e.g. ../..) and writing outside the cache directory. Sanitize/validate db_name (e.g. with sanitize_id) and use the sanitized value for the directory name; ideally reject if sanitization changes the input to avoid ambiguous lookups.
app/utils/workspace.py
Outdated
| # KBUtilLib can hang indefinitely and doesn't respect timeouts | ||
| logger.info(f"Using direct API calls (KBUtilLib disabled) for {self.kb_env}") | ||
| self._use_kbutillib = False | ||
| return |
There was a problem hiding this comment.
_init_client returns early, leaving the remainder of the method permanently unreachable. If KBUtilLib is intentionally disabled, consider deleting the dead code entirely or gating it behind a configuration/feature flag so the control flow is clear and future re-enablement is straightforward.
| return |
| schemas=schemas, | ||
| total_rows=total_rows, | ||
| source="Cache" if target_db["db_path"].exists() else "Downloaded", | ||
| source="Cache" if db_path.exists() else "Downloaded", | ||
| api_version="2.1" |
There was a problem hiding this comment.
source is computed using db_path.exists() after download_db_multi returns. At that point the file will exist both for cache hits and fresh downloads, so this will always report "Cache" and won’t reflect whether the request actually triggered a download. Track existence before calling download_db_multi (or have it return a was_cached flag) and set source based on that.
| # ============================================================================= | ||
| # MULTI-DATABASE ENDPOINTS (Path-based routing) | ||
| # /object/{ws_ref}/databases - List all databases in an object | ||
| # /object/{ws_ref}/db/{db_name}/tables - List tables in a specific database | ||
| # /object/{ws_ref}/db/{db_name}/tables/{table}/data - Query data from specific DB | ||
| # ============================================================================= |
There was a problem hiding this comment.
The section header comment still says "Path-based routing" and lists the old /object/{ws_ref}/... routes, but the endpoints below now use upa as a query parameter. Update/remove the comment block to avoid misleading documentation for future maintainers.
app/utils/workspace.py
Outdated
| # Original KBUtilLib code (disabled): | ||
| # try: | ||
| # if not HAS_KBUTILLIB: | ||
| # raise ImportError("KBUtilLib not found") | ||
| # ... | ||
| # except Exception as e: | ||
| # logger.warning(f"KBUtilLib not available: {e}. Using fallback.") | ||
| # self._use_kbutillib = False |
There was a problem hiding this comment.
This comment appears to contain commented-out code.
| # Original KBUtilLib code (disabled): | |
| # try: | |
| # if not HAS_KBUTILLIB: | |
| # raise ImportError("KBUtilLib not found") | |
| # ... | |
| # except Exception as e: | |
| # logger.warning(f"KBUtilLib not available: {e}. Using fallback.") | |
| # self._use_kbutillib = False | |
| # Historical note: the original implementation attempted to initialize | |
| # KBUtilLib via KBWSUtils/NotebookUtils and, if that failed (for example, | |
| # when KBUtilLib was not available), it logged a warning and fell back to | |
| # direct API calls. That logic has been intentionally disabled because the | |
| # library can hang indefinitely and does not respect timeouts. |
app/utils/workspace.py
Outdated
|
|
||
| # Original KBUtilLib code (disabled): | ||
| # try: | ||
| # if not HAS_KBUTILLIB: | ||
| # raise ImportError("KBUtilLib not found") | ||
| # ... | ||
| # except Exception as e: | ||
| # logger.warning(f"KBUtilLib not available: {e}. Using fallback.") | ||
| # self._use_kbutillib = False | ||
|
|
There was a problem hiding this comment.
This comment appears to contain commented-out code.
| # Original KBUtilLib code (disabled): | |
| # try: | |
| # if not HAS_KBUTILLIB: | |
| # raise ImportError("KBUtilLib not found") | |
| # ... | |
| # except Exception as e: | |
| # logger.warning(f"KBUtilLib not available: {e}. Using fallback.") | |
| # self._use_kbutillib = False | |
app/routes.py
Outdated
| import time | ||
| logger.info(f"[list_databases_in_object] Starting for UPA={upa}, kb_env={kb_env}") | ||
| start_time = time.time() |
There was a problem hiding this comment.
Variable start_time is not used.
| import time | |
| logger.info(f"[list_databases_in_object] Starting for UPA={upa}, kb_env={kb_env}") | |
| start_time = time.time() | |
| logger.info(f"[list_databases_in_object] Starting for UPA={upa}, kb_env={kb_env}") |
… code, and improve source reporting
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Seed a single-db cache (the /db/ routes still call download_all_pangenome_dbs | ||
| # which needs KBase, so we test the single-object endpoints with cached data) |
There was a problem hiding this comment.
The comment mentions download_all_pangenome_dbs but this function has been renamed to download_multi_dbs in this PR. However, the comment appears to be incorrect about the behavior: with this PR, the /db/ routes now use download_db_multi which downloads only a single database, not all databases. Update the comment to reflect the actual behavior.
| # Seed a single-db cache (the /db/ routes still call download_all_pangenome_dbs | |
| # which needs KBase, so we test the single-object endpoints with cached data) | |
| # Seed a single-db cache. The /db/ routes use download_db_multi, which | |
| # downloads a database from KBase, so here we test single-object endpoints | |
| # that operate on locally cached data instead. |
| @router.get("/db/{db_name}/tables") | ||
| # Use query parameter for UPA: ?upa={ws_ref} |
There was a problem hiding this comment.
The code example is inconsistent with the actual implementation after this PR's changes. The example should be updated to:
- Use
upaas a Query parameter instead ofws_refas a Path parameter - Call
download_db_multiinstead ofdownload_all_pangenome_dbs(which has been renamed todownload_multi_dbsand is no longer the recommended approach for single-database access) - Remove the manual search for the target database, since
download_db_multihandles that directly
The updated example should match the pattern used in the actual route handler in app/routes.py.
| # Sanitize db_name to prevent path traversal (e.g., "../../etc") | ||
| safe_db_name = sanitize_id(db_name) | ||
|
|
||
| # Fast path: if already cached, return without hitting Workspace/Shock | ||
| db_dir = base_dir / safe_db_name |
There was a problem hiding this comment.
Inconsistent path sanitization between download_db_multi and download_multi_dbs. The download_db_multi function sanitizes the database name for the cache path (line 645-648), but download_multi_dbs uses the unsanitized database name directly (workspace.py line 567). This means if a database name contains characters that are sanitized (like colons or slashes), the two functions will create different cache directory structures, leading to cache misses and duplicate downloads. Either both functions should sanitize, or neither should. For consistency and security, both should sanitize using sanitize_id(db_name).
| # Sanitize db_name to prevent path traversal (e.g., "../../etc") | |
| safe_db_name = sanitize_id(db_name) | |
| # Fast path: if already cached, return without hitting Workspace/Shock | |
| db_dir = base_dir / safe_db_name | |
| # Validate db_name to prevent path traversal (e.g., "../../etc") | |
| if Path(db_name).name != db_name or "/" in db_name or "\\" in db_name: | |
| raise ValueError(f"Invalid database name: {db_name!r}") | |
| # Fast path: if already cached, return without hitting Workspace/Shock | |
| db_dir = base_dir / db_name |
|
|
||
| def _seed_multi_db_cache(cache_dir: Path, upa: str, db_name: str) -> Path: | ||
| """ | ||
| Place a test database where `download_all_pangenome_dbs` would cache it. |
There was a problem hiding this comment.
The comment mentions download_all_pangenome_dbs but this function has been renamed to download_multi_dbs in this PR. Update the comment to reflect the new function name.
| Place a test database where `download_all_pangenome_dbs` would cache it. | |
| Place a test database where `download_multi_dbs` would cache it. |
This pull request refactors the API endpoints for multi-database support, improves efficiency by switching from full-object downloads to targeted single-database downloads, and disables a problematic library in favor of direct API calls with proper timeout handling. The changes also update naming conventions for clarity and add logging to help with debugging and performance tracking.
API endpoint refactoring and parameter changes:
upaquery parameter instead of a path parameter for workspace object references, making the API more consistent and easier to use. [1] [2] [3] [4] [5]upaand provide examples for multi-database objects. [1] [2] [3]Performance and efficiency improvements:
download_all_pangenome_dbs) with targeted single-database downloads (download_db_multi), reducing unnecessary data transfer and improving performance. [1] [2] [3]download_dbanddownload_db_multifunctions. [1] [2] [3] [4] [5]Workspace client reliability and logging:
KBaseClientclass due to unreliable timeout handling, switching to direct API calls with explicit timeouts and improved error handling. [1] [2]Naming and documentation updates:
download_pangenome_dbtodownload_dband updated documentation/comments to reflect the broader multi-database support. [1] [2] [3]Bug fixes and minor improvements:
These changes collectively make the API more robust, efficient, and easier to use for multi-database scenarios.