Skip to content

Vibhav dev#11

Merged
VibhavSetlur merged 4 commits intodevelopfrom
vibhav-dev
Feb 11, 2026
Merged

Vibhav dev#11
VibhavSetlur merged 4 commits intodevelopfrom
vibhav-dev

Conversation

@VibhavSetlur
Copy link
Collaborator

This pull request refactors the API endpoints for multi-database support, improves efficiency by switching from full-object downloads to targeted single-database downloads, and disables a problematic library in favor of direct API calls with proper timeout handling. The changes also update naming conventions for clarity and add logging to help with debugging and performance tracking.

API endpoint refactoring and parameter changes:

  • Changed all affected endpoints to use the upa query parameter instead of a path parameter for workspace object references, making the API more consistent and easier to use. [1] [2] [3] [4] [5]
  • Updated endpoint documentation to clarify the use of upa and provide examples for multi-database objects. [1] [2] [3]

Performance and efficiency improvements:

  • Replaced bulk database downloads (download_all_pangenome_dbs) with targeted single-database downloads (download_db_multi), reducing unnecessary data transfer and improving performance. [1] [2] [3]
  • Updated helper functions and imports throughout the codebase to use the new download_db and download_db_multi functions. [1] [2] [3] [4] [5]

Workspace client reliability and logging:

  • Disabled the use of KBUtilLib in the KBaseClient class due to unreliable timeout handling, switching to direct API calls with explicit timeouts and improved error handling. [1] [2]
  • Added detailed logging and timing information to workspace client methods and database download functions for easier debugging and performance analysis. [1] [2] [3] [4]

Naming and documentation updates:

  • Renamed download_pangenome_db to download_db and updated documentation/comments to reflect the broader multi-database support. [1] [2] [3]
  • Clarified terminology in docstrings and comments, replacing "pangenome" with "database" where appropriate. [1] [2]

Bug fixes and minor improvements:

  • Fixed a bug in the response source reporting for table queries, ensuring the correct cache/download status is shown.

These changes collectively make the API more robust, efficient, and easier to use for multi-database scenarios.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors TableScanner’s multi-database (v2.1) API to use upa as a query parameter (avoiding UPA path parsing issues), improves performance by downloading only the required database for /db/... endpoints, and updates the Workspace client to rely on direct API calls with explicit timeouts and additional logging.

Changes:

  • Switched multi-db endpoints from path-based UPA (/object/{ws_ref}/...) to query-param UPA (?upa=...) and updated docs accordingly.
  • Introduced targeted single-database download (download_db_multi) to avoid downloading all DBs for per-DB endpoints.
  • Disabled KBUtilLib usage in the Workspace client and added request timing logs.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/integration/test_multi_database_routes.py Adds integration tests for route registration and UPA query parsing; currently needs adjustments to avoid external KBase calls and to cover /db/... endpoints with seeded cache.
docs/CONTRIBUTING.md Updates example route path but leaves outdated parameters/helpers in the snippet.
docs/ARCHITECTURE.md Updates documented multi-db routes to the new query-parameter UPA format.
docs/API.md Updates public API documentation and examples for multi-db endpoints using upa query parameter.
app/utils/workspace.py Disables KBUtilLib path, adds timing logs, renames/introduces new download helpers including download_db_multi.
app/utils/init.py Updates exported workspace download helper name to download_db.
app/services/db_helper.py Updates helper to use download_db for single-db object resolution.
app/routes.py Refactors multi-db routes to accept upa via query param and uses download_db_multi for per-db endpoints.
Comments suppressed due to low confidence (1)

docs/CONTRIBUTING.md:124

  • This example was partially updated to the new /db/{db_name}/tables route, but the snippet still shows ws_ref as a function parameter and calls download_all_pangenome_dbs. Update the example to use upa as a Query(...) parameter and reflect the new helpers (download_db_multi / download_multi_dbs) so contributors don’t copy a broken pattern.
@router.get("/db/{db_name}/tables")
# Use query parameter for UPA: ?upa={ws_ref}
async def list_tables_in_database(
    ws_ref: str,
    db_name: str,
    authorization: str | None = Header(None),
):
    # Download all databases
    db_infos = await run_sync_in_thread(
        download_all_pangenome_dbs, ws_ref, token, cache_dir, kb_env
    )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +238 to +241
def setUp(self):
self.client = TestClient(app)
self.client.headers["Authorization"] = "Bearer dummy_token"

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestUPAPathEncoding.setUp sets a dummy Authorization header, which causes requests to proceed into the handler and attempt real Workspace/Shock calls (making the tests flaky and contradicting the module docstring about no KBase connectivity). To keep this purely local, omit auth so the endpoint returns 401 (still proving routing/query parsing works), or mock/patch the download/workspace client calls.

Copilot uses AI. Check for mistakes.
Comment on lines +80 to +89
def _seed_multi_db_cache(cache_dir: Path, upa: str, db_name: str) -> Path:
"""
Place a test database where `download_all_pangenome_dbs` would cache it.
Layout: {cache_dir}/{safe_upa}/{db_name}/tables.db
"""
safe = upa.replace("/", "_").replace(":", "_").replace(" ", "_")
target_dir = cache_dir / safe / db_name
db_path = target_dir / "tables.db"
_create_test_db(db_path)
return db_path
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file defines _seed_multi_db_cache, but there are no assertions that exercise the new multi-db endpoints (/databases, /db/{db_name}/tables, /db/{db_name}/tables/{table}/data) against a locally seeded multi-db cache layout. Adding tests that seed {cache_dir}/{safe_upa}/{db_name}/tables.db and then call the /db/... endpoints would cover the new download_db_multi fast-path and avoid relying on external KBase services.

Copilot uses AI. Check for mistakes.
Comment on lines 46 to 51
from app.utils.workspace import (
download_pangenome_db,
download_all_pangenome_dbs,
download_db,
download_multi_dbs,
download_db_multi,
get_object_type,
)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download_db is imported here but doesn’t appear to be used anywhere in this module (only download_multi_dbs / download_db_multi are referenced). Consider removing the unused import to avoid lint/CI failures and keep the dependency surface minimal.

Copilot uses AI. Check for mistakes.
Comment on lines +652 to +656
base_dir = get_upa_cache_path(cache_dir, berdl_table_id)

# Fast path: if already cached, return without hitting Workspace/Shock
db_dir = base_dir / db_name
db_path = db_dir / "tables.db"
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download_db_multi uses the user-supplied db_name directly in filesystem paths (base_dir / db_name). Since db_name comes from a path parameter, this allows path traversal (e.g. ../..) and writing outside the cache directory. Sanitize/validate db_name (e.g. with sanitize_id) and use the sanitized value for the directory name; ideally reject if sanitization changes the input to avoid ambiguous lookups.

Copilot uses AI. Check for mistakes.
# KBUtilLib can hang indefinitely and doesn't respect timeouts
logger.info(f"Using direct API calls (KBUtilLib disabled) for {self.kb_env}")
self._use_kbutillib = False
return
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_init_client returns early, leaving the remainder of the method permanently unreachable. If KBUtilLib is intentionally disabled, consider deleting the dead code entirely or gating it behind a configuration/feature flag so the control flow is clear and future re-enablement is straightforward.

Suggested change
return

Copilot uses AI. Check for mistakes.
Comment on lines 884 to 887
schemas=schemas,
total_rows=total_rows,
source="Cache" if target_db["db_path"].exists() else "Downloaded",
source="Cache" if db_path.exists() else "Downloaded",
api_version="2.1"
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source is computed using db_path.exists() after download_db_multi returns. At that point the file will exist both for cache hits and fresh downloads, so this will always report "Cache" and won’t reflect whether the request actually triggered a download. Track existence before calling download_db_multi (or have it return a was_cached flag) and set source based on that.

Copilot uses AI. Check for mistakes.
Comment on lines 678 to 683
# =============================================================================
# MULTI-DATABASE ENDPOINTS (Path-based routing)
# /object/{ws_ref}/databases - List all databases in an object
# /object/{ws_ref}/db/{db_name}/tables - List tables in a specific database
# /object/{ws_ref}/db/{db_name}/tables/{table}/data - Query data from specific DB
# =============================================================================
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section header comment still says "Path-based routing" and lists the old /object/{ws_ref}/... routes, but the endpoints below now use upa as a query parameter. Update/remove the comment block to avoid misleading documentation for future maintainers.

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +82
# Original KBUtilLib code (disabled):
# try:
# if not HAS_KBUTILLIB:
# raise ImportError("KBUtilLib not found")
# ...
# except Exception as e:
# logger.warning(f"KBUtilLib not available: {e}. Using fallback.")
# self._use_kbutillib = False
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to contain commented-out code.

Suggested change
# Original KBUtilLib code (disabled):
# try:
# if not HAS_KBUTILLIB:
# raise ImportError("KBUtilLib not found")
# ...
# except Exception as e:
# logger.warning(f"KBUtilLib not available: {e}. Using fallback.")
# self._use_kbutillib = False
# Historical note: the original implementation attempted to initialize
# KBUtilLib via KBWSUtils/NotebookUtils and, if that failed (for example,
# when KBUtilLib was not available), it logged a warning and fell back to
# direct API calls. That logic has been intentionally disabled because the
# library can hang indefinitely and does not respect timeouts.

Copilot uses AI. Check for mistakes.
Comment on lines 74 to 83

# Original KBUtilLib code (disabled):
# try:
# if not HAS_KBUTILLIB:
# raise ImportError("KBUtilLib not found")
# ...
# except Exception as e:
# logger.warning(f"KBUtilLib not available: {e}. Using fallback.")
# self._use_kbutillib = False

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to contain commented-out code.

Suggested change
# Original KBUtilLib code (disabled):
# try:
# if not HAS_KBUTILLIB:
# raise ImportError("KBUtilLib not found")
# ...
# except Exception as e:
# logger.warning(f"KBUtilLib not available: {e}. Using fallback.")
# self._use_kbutillib = False

Copilot uses AI. Check for mistakes.
app/routes.py Outdated
Comment on lines +714 to +716
import time
logger.info(f"[list_databases_in_object] Starting for UPA={upa}, kb_env={kb_env}")
start_time = time.time()
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable start_time is not used.

Suggested change
import time
logger.info(f"[list_databases_in_object] Starting for UPA={upa}, kb_env={kb_env}")
start_time = time.time()
logger.info(f"[list_databases_in_object] Starting for UPA={upa}, kb_env={kb_env}")

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +295 to +296
# Seed a single-db cache (the /db/ routes still call download_all_pangenome_dbs
# which needs KBase, so we test the single-object endpoints with cached data)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions download_all_pangenome_dbs but this function has been renamed to download_multi_dbs in this PR. However, the comment appears to be incorrect about the behavior: with this PR, the /db/ routes now use download_db_multi which downloads only a single database, not all databases. Update the comment to reflect the actual behavior.

Suggested change
# Seed a single-db cache (the /db/ routes still call download_all_pangenome_dbs
# which needs KBase, so we test the single-object endpoints with cached data)
# Seed a single-db cache. The /db/ routes use download_db_multi, which
# downloads a database from KBase, so here we test single-object endpoints
# that operate on locally cached data instead.

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +115
@router.get("/db/{db_name}/tables")
# Use query parameter for UPA: ?upa={ws_ref}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code example is inconsistent with the actual implementation after this PR's changes. The example should be updated to:

  1. Use upa as a Query parameter instead of ws_ref as a Path parameter
  2. Call download_db_multi instead of download_all_pangenome_dbs (which has been renamed to download_multi_dbs and is no longer the recommended approach for single-database access)
  3. Remove the manual search for the target database, since download_db_multi handles that directly

The updated example should match the pattern used in the actual route handler in app/routes.py.

Copilot uses AI. Check for mistakes.
Comment on lines +644 to +648
# Sanitize db_name to prevent path traversal (e.g., "../../etc")
safe_db_name = sanitize_id(db_name)

# Fast path: if already cached, return without hitting Workspace/Shock
db_dir = base_dir / safe_db_name
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent path sanitization between download_db_multi and download_multi_dbs. The download_db_multi function sanitizes the database name for the cache path (line 645-648), but download_multi_dbs uses the unsanitized database name directly (workspace.py line 567). This means if a database name contains characters that are sanitized (like colons or slashes), the two functions will create different cache directory structures, leading to cache misses and duplicate downloads. Either both functions should sanitize, or neither should. For consistency and security, both should sanitize using sanitize_id(db_name).

Suggested change
# Sanitize db_name to prevent path traversal (e.g., "../../etc")
safe_db_name = sanitize_id(db_name)
# Fast path: if already cached, return without hitting Workspace/Shock
db_dir = base_dir / safe_db_name
# Validate db_name to prevent path traversal (e.g., "../../etc")
if Path(db_name).name != db_name or "/" in db_name or "\\" in db_name:
raise ValueError(f"Invalid database name: {db_name!r}")
# Fast path: if already cached, return without hitting Workspace/Shock
db_dir = base_dir / db_name

Copilot uses AI. Check for mistakes.

def _seed_multi_db_cache(cache_dir: Path, upa: str, db_name: str) -> Path:
"""
Place a test database where `download_all_pangenome_dbs` would cache it.
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions download_all_pangenome_dbs but this function has been renamed to download_multi_dbs in this PR. Update the comment to reflect the new function name.

Suggested change
Place a test database where `download_all_pangenome_dbs` would cache it.
Place a test database where `download_multi_dbs` would cache it.

Copilot uses AI. Check for mistakes.
@VibhavSetlur VibhavSetlur merged commit ccf7352 into develop Feb 11, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants