-
Notifications
You must be signed in to change notification settings - Fork 126
APP-10775: viam training-script test-local #5524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| container, ok := res.ContainerMap[version] | ||
| if !ok { | ||
| if dockerVertexImageRegex.MatchString(version) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no feature flags to use here, so I threw this back in, but lmk if I should remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we expose an endpoint for GetContainer() that returns a containerURI given a key (and would just return the URI if they're a power user? @tahiyasalam what do you think?
| // Validate that the key portion only contains safe characters | ||
| parts := strings.SplitN(arg, "=", 2) | ||
| key := parts[0] | ||
| if !isValidArgumentKey(key) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clean the input, let me know if there's a better way to do this.
|
|
||
| // Provide additional context for platform-related errors | ||
| errMsg := err.Error() | ||
| if strings.Contains(errMsg, "platform") || strings.Contains(errMsg, "architecture") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is kind of ugly, but I thought it was worth including. it's also in the command description so potentially removable.
cli/app.go
Outdated
| Flags: []cli.Flag{ | ||
| &cli.StringFlag{ | ||
| Name: trainFlagDatasetRoot, | ||
| Usage: "path to the dataset root directory (where dataset.jsonl and image files are located). This is where you ran the 'viam dataset export' command from.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if this is too confusing.
| └── images/ | ||
| └── cat.jpg | ||
| NOTES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this goes in documentation? I also want to add that if the containers really slow, it could be because the dataset root or training script directory has a bunch of extra files. Apparently mounting volumes can be expensive.
|
I am still working on testing on Windows and Linux, but would love to get eyes on this! |
| } | ||
|
|
||
| // MLTrainingScriptTestLocalAction runs training locally in a Docker container. | ||
| func MLTrainingScriptTestLocalAction(c *cli.Context, args mlTrainingScriptTestLocalArgs) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is probably gonna take awhile. I think maybe we should add info logging all over the place so people know that it's working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only part of this function that takes a significant amount of time is actually running the docker command to run training at the bottom and the logs from the docker container show up in the terminal. I also added timeouts for checkDockerAvailable so I think that should be fine? I'd suggest running this command, sorry I know it's annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is that the container is really really really big. In the office downloading this can take 30 minutes. I know that it logs the docker stuff but giving people a heads up somewhere would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oop that is a really good point, thanks!
| // Check if Docker is available | ||
| if err := checkDockerAvailable(); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, just adding a "Docker is available. Using image BLAH"
| } | ||
|
|
||
| // Create temporary training script | ||
| tmpScript, err := createTrainingScript(args.CustomArgs, datasetFileRelative) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we doing here? We're creating our own training script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cloud training, the vertex container has an entrypoint runcloudml.py that installs dependencies and runs the training script that we provide. I'm mimicking its functionality here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, cool! Didn't realize that.
| } | ||
|
|
||
| // Ensure output directory exists | ||
| if err := os.MkdirAll(outputDir, 0o750); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if I run this twice? Does MkdirAll return an error if the directory already exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the directory already exists, it returns nil, nothing happens.
Adds
training-script test-localCLI command for testing ML training scripts locally in Docker before submitting to cloud.setup.py,model/training.py) and dataset pathsWorking on manual testing.