Skip to content

Keep going display on HUD: log classifer to handle temp logs #6723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 20, 2025

Conversation

clee2000
Copy link
Contributor

@clee2000 clee2000 commented Jun 6, 2025

Context is pytorch/pytorch#155371

Add a temp_log input to read the log from the temp_log bucket and put the attribute into a temp attribute

This should be a noop in the normal flow

I'm not actually sure if this will work, since the webhook to dynamo code seems to overwrite the entire dynamo entry, so if another webhook gets sent, this attribute would be lost. But webhooks only get sent when something happens, so maybe only when a step finishes or the job finishes? so it might be ok. But also the normal flow doesn't use this so it should be safe in that sense to merge

I'm also not sure if a temp attribute is necessary, since if the above happens the temp attribute will be gone

Possible solutions if the above happens and the temp attribute disappears:

  • Insert into different table/s3 bucket -> different clickhouse table -> do a join
    • I think this will have bad perf b/c join
  • Change webhook to dynamo to be update instead of overwrite
    • not sure if the fields of the webhook change

Testing: in fixtures/request.json, changed the raw query string to job_id=44069312114&repo=pytorch%2Fpytorch&temp_log=true and did the things in the readme and saw that the temp value got written to dynamo under a new temp attribute, also edited the clickhouse tables to add columns and saw that it got inserted. Also ran it with without temp_log and saw that the normal log got used.

Copy link

vercel bot commented Jun 6, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Jun 17, 2025 6:17pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 6, 2025
@clee2000 clee2000 force-pushed the csl/keep_going_log_classifier branch from 4b282f6 to 181cc7f Compare June 17, 2025 18:17
@clee2000 clee2000 changed the title [wip] keep going log classifier changes Keep going display on HUD: log classifer to handle temp logs Jun 17, 2025
@clee2000 clee2000 marked this pull request as ready for review June 17, 2025 18:44
@clee2000 clee2000 requested a review from a team June 17, 2025 18:50
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably looks fine, but would be good to add a test plan (and may be a fallback: if temp log is not found use regular one)

@clee2000
Copy link
Contributor Author

Probably looks fine, but would be good to add a test plan (and may be a fallback: if temp log is not found use regular one)

I'm ok with it erroring/not uploading anything, it is temp
I will figure out how to add tests

@huydhn
Copy link
Contributor

huydhn commented Jun 19, 2025

+1 on having a test plan because breaking log classifier has much wider impact that expect. When it isn't working correctly, Dr.CI wouldn't work, and consequently mergebot. The change looks ok though

Do you have an example temp log on S3 already from a CI job? Then testing can be done locally by changing https://github.com/pytorch/test-infra/blob/main/aws/lambda/log-classifier/fixtures/request.json#L5

@clee2000 clee2000 merged commit 2324e15 into main Jun 20, 2025
6 checks passed
@clee2000 clee2000 deleted the csl/keep_going_log_classifier branch June 20, 2025 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants