Skip to content

Conversation

@vpandiarajan20
Copy link
Member

@vpandiarajan20 vpandiarajan20 commented Nov 21, 2025

Adds training-script test-local CLI command for testing ML training scripts locally in Docker before submitting to cloud.

  • Validates training script structure (setup.py, model/training.py) and dataset paths
  • Mounts local directories into Docker containers with proper working directory setup
  • Supports custom training arguments and both standard/custom container versions
  • Handles Docker availability checks, signal interrupts, and platform compatibility (linux/x86_64)

Working on manual testing.

  • Apple Silicon Mac
  • Ubuntu
  • Intel Mac
  • Windows

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Nov 21, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 21, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 21, 2025

container, ok := res.ContainerMap[version]
if !ok {
if dockerVertexImageRegex.MatchString(version) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no feature flags to use here, so I threw this back in, but lmk if I should remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we expose an endpoint for GetContainer() that returns a containerURI given a key (and would just return the URI if they're a power user? @tahiyasalam what do you think?

// Validate that the key portion only contains safe characters
parts := strings.SplitN(arg, "=", 2)
key := parts[0]
if !isValidArgumentKey(key) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clean the input, let me know if there's a better way to do this.


// Provide additional context for platform-related errors
errMsg := err.Error()
if strings.Contains(errMsg, "platform") || strings.Contains(errMsg, "architecture") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kind of ugly, but I thought it was worth including. it's also in the command description so potentially removable.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 24, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 24, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 24, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 24, 2025
cli/app.go Outdated
Flags: []cli.Flag{
&cli.StringFlag{
Name: trainFlagDatasetRoot,
Usage: "path to the dataset root directory (where dataset.jsonl and image files are located). This is where you ran the 'viam dataset export' command from.",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if this is too confusing.

└── images/
└── cat.jpg
NOTES:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this goes in documentation? I also want to add that if the containers really slow, it could be because the dataset root or training script directory has a bunch of extra files. Apparently mounting volumes can be expensive.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 25, 2025
@viambot viambot removed the safe to test This pull request is marked safe to test from a trusted zone label Nov 25, 2025
@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Nov 25, 2025
@vpandiarajan20 vpandiarajan20 marked this pull request as ready for review November 26, 2025 16:51
@vpandiarajan20 vpandiarajan20 requested a review from a team as a code owner November 26, 2025 16:51
@vpandiarajan20
Copy link
Member Author

I am still working on testing on Windows and Linux, but would love to get eyes on this!

}

// MLTrainingScriptTestLocalAction runs training locally in a Docker container.
func MLTrainingScriptTestLocalAction(c *cli.Context, args mlTrainingScriptTestLocalArgs) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is probably gonna take awhile. I think maybe we should add info logging all over the place so people know that it's working.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only part of this function that takes a significant amount of time is actually running the docker command to run training at the bottom and the logs from the docker container show up in the terminal. I also added timeouts for checkDockerAvailable so I think that should be fine? I'd suggest running this command, sorry I know it's annoying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that the container is really really really big. In the office downloading this can take 30 minutes. I know that it logs the docker stuff but giving people a heads up somewhere would be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oop that is a really good point, thanks!

// Check if Docker is available
if err := checkDockerAvailable(); err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, just adding a "Docker is available. Using image BLAH"

}

// Create temporary training script
tmpScript, err := createTrainingScript(args.CustomArgs, datasetFileRelative)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we doing here? We're creating our own training script?

Copy link
Member Author

@vpandiarajan20 vpandiarajan20 Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cloud training, the vertex container has an entrypoint runcloudml.py that installs dependencies and runs the training script that we provide. I'm mimicking its functionality here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, cool! Didn't realize that.

}

// Ensure output directory exists
if err := os.MkdirAll(outputDir, 0o750); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if I run this twice? Does MkdirAll return an error if the directory already exists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the directory already exists, it returns nil, nothing happens.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 1, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 1, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test This pull request is marked safe to test from a trusted zone

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants