Support model deployment #113

arcticfly · 2025-05-07T23:53:49Z

For models that Together supports as serverless endpoints, we can now deploy LoRAs to Together and query them.

Changes

Document HUGGINGFACE_TOKEN and HUGGING_FACE_HUB_TOKEN, used for training gated models
Document YOUR_TOGETHER_API_KEY, used for uploading models to Together
Update all tic tac toe rollouts to use scenario classes
Only report chat tic tac toe completions after trajectory completes
Create utility function get_step_checkpoint_dir
Add archive_and_presign_step_url for preparing checkpoint dir to upload to Together
Add _experimental_deploy function to LocalBackend and Backend
Add utils to manage Together deployments
Update tic-tac-toe-local.py to deploy a model to Together and use it in a rollout
Add UnsupportedBaseModelDeploymentError and LoRADeploymentTimedOutError errors
Rewrite Together job statuses into simpler LoRADeploymentJobStatusBody type

Some relevant types:

class LoRADeploymentProvider(str, Enum):
    TOGETHER = "together"


class LoRADeploymentJobStatus(str, Enum):
    QUEUED = "Queued"
    RUNNING = "Running"
    COMPLETE = "Complete"
    FAILED = "Failed"


class LoRADeploymentJobStatusBody(BaseModel):
    status: LoRADeploymentJobStatus
    job_id: str
    model_name: str
    failure_reason: str | None

TODO:
Test _experimental_deploy through SkyPilot

corbt · 2025-05-08T00:26:44Z

I think those token names have been deprecated and replaced with HF_TOKEN. https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables

corbt · 2025-05-08T00:27:41Z

Why the big uv.lock diff? Doesn't look like pyproject.toml changed?

arcticfly · 2025-05-08T00:29:07Z

Why the big uv.lock diff? Doesn't look like pyproject.toml changed?

I assume someone ran uv sync on mac

corbt · 2025-05-08T00:30:34Z

src/art/local/backend.py

@@ -379,3 +384,43 @@ async def _experimental_push_to_s3(
            delete=delete,
            art_path=self._path,
        )
+
+    async def _experimental_deploy(


Let's add a required parameter deploy_to, which must be set to "together". Will help people understand what's going on.

Added this class:

class LoRADeploymentProvider(str, Enum): TOGETHER = "together"

corbt · 2025-05-08T00:31:57Z

src/art/utils/deploy_model.py

+            return None
+
+
+async def deploy_together(


Can we add an explicit check in this function that the base model is one that Together supports? I guess the user will figure it out eventually if they try one that isn't supported but would be nice to have it explicitly up front so you can take that into account when deciding which base model to train.

Now returning a UnsupportedBaseModelDeploymentError error.

corbt · 2025-05-08T00:42:49Z

I assume someone ran uv sync on mac

I'm not an expert but according to this uv.lock is supposed to be cross-platform: https://docs.astral.sh/uv/guides/projects/#pyprojecttoml

arcticfly · 2025-05-08T01:23:34Z

I assume someone ran uv sync on mac

I'm not an expert but according to this uv.lock is supposed to be cross-platform: https://docs.astral.sh/uv/guides/projects/#pyprojecttoml

Might not be a platform issue, but I have noticed that running on my local mac messes up the uv lock and interferes with installation on a linux machine. Worth looking into, probably not in scope for this PR.

corbt

nice work!

arcticfly added 9 commits May 6, 2025 17:18

Add skeleton of _experimental_deploy

1ddced6

Get presigned url

98934bd

Upload LoRAs to Together

0623ecc

Add deploy_model utils

b48d5a9

Document HF tokens

6d89d7f

Better OP reporting

2e6b708

Deploy models to Together

1e406d5

Document TOGETHER_API_KEY

f9f898e

Remove _experimental_deploy from Model

bf416f4

arcticfly requested a review from corbt May 8, 2025 00:06

corbt reviewed May 8, 2025

View reviewed changes

Document HF_TOKEN

cad10ec

corbt reviewed May 8, 2025

View reviewed changes

Shorten ttt training loop

274fd4f

arcticfly added 2 commits May 8, 2025 02:32

Introduce LoRADeploymentJobStatus, wait for deployment

6c4f85b

Properly handle deployment failures

82cb98b

corbt approved these changes May 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support model deployment #113

Support model deployment #113

arcticfly commented May 7, 2025 •

edited

Loading

corbt commented May 8, 2025

corbt commented May 8, 2025

arcticfly commented May 8, 2025

corbt May 8, 2025

arcticfly May 8, 2025

corbt May 8, 2025

arcticfly May 8, 2025

corbt commented May 8, 2025

arcticfly commented May 8, 2025

corbt left a comment

Support model deployment #113

Are you sure you want to change the base?

Support model deployment #113

Conversation

arcticfly commented May 7, 2025 • edited Loading

Changes

corbt commented May 8, 2025

corbt commented May 8, 2025

arcticfly commented May 8, 2025

corbt May 8, 2025

Choose a reason for hiding this comment

arcticfly May 8, 2025

Choose a reason for hiding this comment

corbt May 8, 2025

Choose a reason for hiding this comment

arcticfly May 8, 2025

Choose a reason for hiding this comment

corbt commented May 8, 2025

arcticfly commented May 8, 2025

corbt left a comment

Choose a reason for hiding this comment

arcticfly commented May 7, 2025 •

edited

Loading