Skip to content

Support model deployment #113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Support model deployment #113

wants to merge 13 commits into from

Conversation

arcticfly
Copy link
Contributor

@arcticfly arcticfly commented May 7, 2025

For models that Together supports as serverless endpoints, we can now deploy LoRAs to Together and query them.

Changes

  • Document HUGGINGFACE_TOKEN and HUGGING_FACE_HUB_TOKEN, used for training gated models
  • Document YOUR_TOGETHER_API_KEY, used for uploading models to Together
  • Update all tic tac toe rollouts to use scenario classes
  • Only report chat tic tac toe completions after trajectory completes
  • Create utility function get_step_checkpoint_dir
  • Add archive_and_presign_step_url for preparing checkpoint dir to upload to Together
  • Add _experimental_deploy function to LocalBackend and Backend
  • Add utils to manage Together deployments
  • Update tic-tac-toe-local.py to deploy a model to Together and use it in a rollout
  • Add UnsupportedBaseModelDeploymentError and LoRADeploymentTimedOutError errors
  • Rewrite Together job statuses into simpler LoRADeploymentJobStatusBody type

Some relevant types:

class LoRADeploymentProvider(str, Enum):
    TOGETHER = "together"


class LoRADeploymentJobStatus(str, Enum):
    QUEUED = "Queued"
    RUNNING = "Running"
    COMPLETE = "Complete"
    FAILED = "Failed"


class LoRADeploymentJobStatusBody(BaseModel):
    status: LoRADeploymentJobStatus
    job_id: str
    model_name: str
    failure_reason: str | None

TODO:
Test _experimental_deploy through SkyPilot

@arcticfly arcticfly requested a review from corbt May 8, 2025 00:06
@corbt
Copy link
Contributor

corbt commented May 8, 2025

I think those token names have been deprecated and replaced with HF_TOKEN. https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables

@corbt
Copy link
Contributor

corbt commented May 8, 2025

Why the big uv.lock diff? Doesn't look like pyproject.toml changed?

@arcticfly
Copy link
Contributor Author

Why the big uv.lock diff? Doesn't look like pyproject.toml changed?

I assume someone ran uv sync on mac

@@ -379,3 +384,43 @@ async def _experimental_push_to_s3(
delete=delete,
art_path=self._path,
)

async def _experimental_deploy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a required parameter deploy_to, which must be set to "together". Will help people understand what's going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this class:

class LoRADeploymentProvider(str, Enum):
    TOGETHER = "together"

return None


async def deploy_together(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an explicit check in this function that the base model is one that Together supports? I guess the user will figure it out eventually if they try one that isn't supported but would be nice to have it explicitly up front so you can take that into account when deciding which base model to train.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now returning a UnsupportedBaseModelDeploymentError error.

@corbt
Copy link
Contributor

corbt commented May 8, 2025

I assume someone ran uv sync on mac

I'm not an expert but according to this uv.lock is supposed to be cross-platform: https://docs.astral.sh/uv/guides/projects/#pyprojecttoml

@arcticfly
Copy link
Contributor Author

I assume someone ran uv sync on mac

I'm not an expert but according to this uv.lock is supposed to be cross-platform: https://docs.astral.sh/uv/guides/projects/#pyprojecttoml

Might not be a platform issue, but I have noticed that running on my local mac messes up the uv lock and interferes with installation on a linux machine. Worth looking into, probably not in scope for this PR.

Copy link
Contributor

@corbt corbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants