-
Notifications
You must be signed in to change notification settings - Fork 835
π feat(inspect): Add progressbar #3045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/geti-inspect
Are you sure you want to change the base?
π feat(inspect): Add progressbar #3045
Conversation
Signed-off-by: Ashwin Vaidya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @ashwinvaidya17. Looks good overall, left some comments
Signed-off-by: Ashwin Vaidya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a progress bar feature to the inspect interface that displays real-time training progress with the ability to cancel jobs. The implementation includes both frontend UI components and backend progress tracking infrastructure.
Key changes:
- Adds a new status bar component with integrated progress display for training jobs
- Implements real-time progress tracking using Server-Sent Events (SSE) and Lightning callbacks
- Introduces job cancellation functionality with corresponding API endpoints
Reviewed Changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| application/ui/src/routes/inspect/inspect.tsx | Adds StatusBar component to the inspect layout |
| application/ui/src/features/inspect/statusbar/statusbar.component.tsx | Creates status bar container component |
| application/ui/src/features/inspect/statusbar/items/progressbar.component.tsx | Implements progress bar with job polling and cancellation |
| application/backend/src/utils/callbacks.py | Adds Lightning callback for progress synchronization |
| application/backend/src/services/training_service.py | Integrates progress tracking into training workflow |
| application/backend/src/services/job_service.py | Adds progress streaming and job cancellation methods |
| application/backend/src/pydantic_models/job.py | Extends job model with stage field and cancellation response |
| application/backend/src/db/schema.py | Adds stage column to job database schema |
| application/backend/src/api/endpoints/job_endpoints.py | Adds progress streaming and cancellation endpoints |
Files not reviewed (1)
- application/ui/package-lock.json: Language not supported
π‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def setup(self, trainer: Trainer, pl_module: LightningModule, stage: str) -> None: | ||
| self._send_progress(0, stage) |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stage parameter should be of type RunningStage to match the _send_progress method signature, but it's declared as str. This will cause a type error.
| def teardown(self, trainer: Trainer, pl_module: LightningModule, stage: RunningStage) -> None: | ||
| self._send_progress(1.0, stage) |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The teardown method signature uses RunningStage type but the setup method uses str type. These should be consistent, and both should use RunningStage type.
| cached_still_running = await cls.is_job_still_running(job_id=job_id) | ||
| last_status_check = now | ||
| still_running = cached_still_running | ||
| yield json.dumps({"progress": job.progress, "stage": job.stage}) |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job object is fetched once outside the loop but never refreshed. Progress and stage values will remain static throughout the streaming. The job should be refetched from the database in each iteration.
Signed-off-by: Ashwin Vaidya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.
Files not reviewed (1)
- application/ui/package-lock.json: Language not supported
π‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
application/ui/src/features/inspect/statusbar/items/progressbar.component.tsx
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <[email protected]> Signed-off-by: Ashwin Vaidya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.
Files not reviewed (1)
- application/ui/package-lock.json: Language not supported
π‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
application/backend/src/db/schema.py
Outdated
| id: Mapped[str] = mapped_column(primary_key=True, default=lambda: str(uuid4())) | ||
| project_id: Mapped[str] = mapped_column(ForeignKey("projects.id")) | ||
| type: Mapped[str] = mapped_column(String(64), nullable=False) | ||
| stage: Mapped[JobStage] = mapped_column(Enum(JobStage), nullable=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of declaring an Enum column in the DB schema?
What is the values that is actually stored in DB, e.g. int or str?
I wonder what happens if you add/remove job stages, will queries still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure how it will respond when we modify it. I was following Mark's suggestion #3045 (comment). Maybe we can achieve this in a different manner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run some testing with Enum columns. It seems to be safe, at least with StrEnum, it saves the enum as a string. Adding/removing values in the code does not seem to require the DB to be updated.
| class JobStage(StrEnum): | ||
| """Job stages follow PyTorch Lightning stages with the addition of idle stage. | ||
| See ``lightning.pytorch.trainer.states.RunningStage`` for more details. | ||
| """ | ||
|
|
||
| IDLE = "idle" | ||
| TRAINING = "train" | ||
| SANITY_CHECKING = "sanity_check" | ||
| VALIDATING = "validate" | ||
| TESTING = "test" | ||
| PREDICTING = "predict" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we are going to have other types of jobs in the future, but these are only related to training and there is overlap with job.status. Perhaps, it's more appropriate to store the training stage as part of job.message string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have to extract it from the string if we want to show it in the progress bar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job message can be reset based on the current stage. So you could just use the whole message.
ATM, it seems you're not setting job.message at all.
In Geti classic, job.step.message was displayed:

Looks like JobStage corresponds to job.step in the image above. However, our job progress is global, so we don't have this mechanism of tracking individual stage progress. Not sure how well this works in terms of UX: users would see progress resetting from 100 to 0 (e.g. from stage "training" to "validate"), but they don't know the total number of stages elapsed or remaining.
| logger.debug("Syncing progress with db stopped") | ||
| synchronization_task.cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could put this 2 lines in a finally block
| # Test callbacks | ||
| def on_test_start(self, trainer: Trainer, pl_module: LightningModule) -> None: | ||
| """Called when testing starts.""" | ||
| self._send_progress(0, JobStage.TESTING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come the progress is 0 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to treat each stage as having different progress instead of clubbing train->test as a single progress.
Currently the progressbar will first show the training progress, and then the testing progress. This just makes the progress a bit verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment above regarding this "staged" progress tracking approach.
Co-authored-by: Max Xiang <[email protected]> Signed-off-by: Ashwin Vaidya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.
Files not reviewed (1)
- application/ui/package-lock.json: Language not supported
π‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Capture pytorch stdout logs into logger | ||
| with redirect_stdout(LoggerStdoutWriter()): # type: ignore[type-var] | ||
| engine.train(model=anomalib_model, datamodule=datamodule) | ||
| engine.fit(model=anomalib_model, datamodule=datamodule) |
Copilot
AI
Nov 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed from engine.train() to engine.fit() - this appears to be an API update but should be verified that this method exists and provides the same functionality.
Signed-off-by: Ashwin Vaidya <[email protected]>
| </InferenceProvider> | ||
| </SelectedMediaItemProvider> | ||
| </Grid> | ||
| <div style={{ display: 'flex', flexDirection: 'column', height: '100%' }}> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's revert this change, the extra div is not needed.
| </Grid> | ||
| <div style={{ display: 'flex', flexDirection: 'column', height: '100%' }}> | ||
| <Grid | ||
| areas={['toolbar sidebar', 'canvas sidebar']} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| areas={['toolbar sidebar', 'canvas sidebar']} | |
| areas={['toolbar sidebar', 'canvas sidebar', 'footer sidebar']} |
The Grid component is essentially a wrapper around css's grid properties we can add a new footer grid cell that is used for displaying the sidebar.
| <div style={{ display: 'flex', flexDirection: 'column', height: '100%' }}> | ||
| <Grid | ||
| areas={['toolbar sidebar', 'canvas sidebar']} | ||
| rows={['size-800', 'minmax(0, 1fr)']} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the above change this should make it so that the footer takes up a minimal height, (do check this though, we might need to tweak this)
| rows={['size-800', 'minmax(0, 1fr)']} | |
| rows={['size-800', 'minmax(0, 1fr)', 'auto']} |
| <Sidebar /> | ||
| </InferenceProvider> | ||
| </SelectedMediaItemProvider> | ||
| </Grid> | ||
| <StatusBar /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| <Sidebar /> | |
| </InferenceProvider> | |
| </SelectedMediaItemProvider> | |
| </Grid> | |
| <StatusBar /> | |
| <Sidebar /> | |
| <Footer /> | |
| </InferenceProvider> | |
| </SelectedMediaItemProvider> | |
| </Grid> | |
Let's move the statusbar into a Footer component and put it inside of the grid.
| export const StatusBar = () => { | ||
| return ( | ||
| <View gridArea={'statusbar'} backgroundColor={'gray-100'} width={'100%'} height={'30px'} overflow={'hidden'}> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| export const StatusBar = () => { | |
| return ( | |
| <View gridArea={'statusbar'} backgroundColor={'gray-100'} width={'100%'} height={'30px'} overflow={'hidden'}> | |
| export const Footer = () => { | |
| return ( | |
| <View gridArea={'footer'} backgroundColor={'gray-100'} width={'100%'} height={'size-400'} overflow={'hidden'}> |
For spacing we try to adhere to spectrum's styling guidelines. This makes it so that the applications' UI is more consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this component feels very non-idomatic react code. I will send you a component that should follow more of our guidelines.
Signed-off-by: Ashwin Vaidya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 10 out of 11 changed files in this pull request and generated 5 comments.
Files not reviewed (1)
- application/ui/package-lock.json: Language not supported
π‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return ( | ||
| <Flex direction='column' gap='size-25'> | ||
| {query.data?.map((line, idx) => <Text key={idx}> {line}</Text>)} | ||
| {query.data?.map((line, idx) => <Text key={idx}> {line.text}</Text>)} |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code attempts to access line.text but the fetchSSE function now yields the entire parsed JSON object. This will cause a runtime error if the object doesn't have a text property.
| {query.data?.map((line, idx) => <Text key={idx}> {line.text}</Text>)} | |
| {query.data?.map((line, idx) => ( | |
| <Text key={idx}> | |
| {typeof line === 'object' && line !== null && 'text' in line | |
| ? line.text | |
| : JSON.stringify(line)} | |
| </Text> | |
| ))} |
| engine = Engine( | ||
| default_root_dir=model.export_path, | ||
| logger=[trackio, tensorboard], | ||
| devices=[0], # Only single GPU training is supported for now |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded device index [0] should be derived from the device parameter or made configurable. This creates inconsistency with the device parameter handling.
| finally: | ||
| logger.debug("Syncing progress with db stopped") | ||
| synchronization_task.cancel() |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The synchronization_task variable is referenced in the finally block but may not be defined if an exception occurs before line 79-83. This will cause an UnboundLocalError.
|
|
||
| def on_test_epoch_end(self, trainer: Trainer, pl_module: LightningModule) -> None: | ||
| """Called when a test epoch ends.""" | ||
| progress = (trainer.current_epoch + 1) / trainer.max_epochs if trainer.max_epochs else 0.5 |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 0.5 (50%) is used as fallback progress when max_epochs is not available. This should be documented or made configurable to clarify the intended behavior.
|
|
||
| def on_predict_epoch_end(self, trainer: Trainer, pl_module: LightningModule) -> None: | ||
| """Called when a prediction epoch ends.""" | ||
| progress = (trainer.current_epoch + 1) / trainer.max_epochs if trainer.max_epochs else 0.5 |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated progress calculation logic with magic number 0.5. This should be extracted to a helper method to avoid code duplication and ensure consistency.
Signed-off-by: Ashwin Vaidya <[email protected]>
π Description
β¨ Changes
Select what type of change your PR is:
β Checklist
Before you submit your pull request, please make sure you have completed the following steps:
For more information about code review checklists, see the Code Review Checklist.