Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement async export for Databricks trace export and make it default #15163

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

B-Step62
Copy link
Collaborator

@B-Step62 B-Step62 commented Mar 28, 2025

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/15163/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/15163/merge#subdirectory=skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s 15163

What changes are proposed in this pull request?

Enable async trace logging when exporting to Databricks trace server. This will be a default behavior and can opt-out by setting MLFLOW_ENABLE_ASYNC_LOGGING to False explicitly.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

I will add more documentation about Databricks external monitoring to https://www.mlflow.org/docs/latest/tracing/production in follow-up.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Suport async trace export to Databricks and make it default behavior.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@Copilot Copilot bot review requested due to automatic review settings March 28, 2025 11:13
@github-actions github-actions bot added area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. labels Mar 28, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements asynchronous Databricks trace export as the default behavior, with supporting changes in environment configuration, export queue implementation, and unit tests.

  • Updated tests to validate both async and sync export behavior.
  • Created a new AsyncTraceExportQueue with configurable parameters for queue size and worker count.
  • Updated REST utilities and Databricks exporter to support retries with a configurable timeout.

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/tracing/export/test_mlflow_exporter.py Removed async queue tests to align with the new async implementation.
tests/tracing/export/test_databricks_exporter.py Added async/sync parametrization and environment variable setups.
tests/tracing/export/test_async_export_queue.py Introduced tests for the new AsyncTraceExportQueue implementation.
mlflow/utils/rest_utils.py Added support for a retry_timeout_seconds parameter in HTTP requests.
mlflow/tracing/export/mlflow.py Updated to initialize the async export queue without passing a client.
mlflow/tracing/export/databricks.py Configured the exporter to default to async mode with a retry timeout.
mlflow/tracing/export/async_export_queue.py New async queue implementation using a bounded queue and worker pool.
mlflow/environment_variables.py Added new environment variables for async logging configurations.
Files not reviewed (1)
  • docs/docs/tracing/api/how-to.mdx: Language not supported
Comments suppressed due to low confidence (2)

tests/tracing/export/test_databricks_exporter.py:170

  • The environment variable 'MLFLOW_ASYNC_TRACE_LOGGING_RETRY_TIMEOUT' is set twice consecutively. Removing the duplicate line will improve clarity and avoid potential confusion.
monkeypatch.setenv("MLFLOW_ASYNC_TRACE_LOGGING_RETRY_TIMEOUT", "3")

mlflow/tracing/export/mlflow.py:45

  • Previously, AsyncTraceExportQueue was initialized with a client argument. If the client functionality is still required for proper trace export handling, consider either reintroducing the parameter or updating the queue implementation to remove the dependency on the client.
self._async_queue = AsyncTraceExportQueue()

Copy link

Documentation preview for 7ae9ba4 will be available when this CircleCI job
completes successfully. You may encounter a {"message":"not found"} error when reloading
a page. If so, add /index.html to the URL.

More info

Comment on lines +73 to +74
while not self._stop_event.is_set():
self._dispatch_task()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this clause necessary? it seems like even when stopped, we drain the queue

Comment on lines +39 to +41
self._is_async = True
if MLFLOW_ENABLE_ASYNC_LOGGING.is_set():
self._is_async = MLFLOW_ENABLE_ASYNC_LOGGING.get()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we document this env var above as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so then maybe we can just add to environment_variables.py with default value of True

Copy link
Collaborator

@daniellok-db daniellok-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with a couple nits!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants