Skip to content

Conversation

@ashwinvaidya17
Copy link
Contributor

πŸ“ Description

  • Provide a clear summary of the changes and the issue that has been addressed.
  • πŸ› οΈ Fixes # (issue number)

✨ Changes

Select what type of change your PR is:

  • πŸš€ New feature (non-breaking change which adds functionality)
  • 🐞 Bug fix (non-breaking change which fixes an issue)
  • πŸ”„ Refactor (non-breaking change which refactors the code base)
  • ⚑ Performance improvements
  • 🎨 Style changes (code style/formatting)
  • πŸ§ͺ Tests (adding/modifying tests)
  • πŸ“š Documentation update
  • πŸ“¦ Build system changes
  • 🚧 CI/CD configuration
  • πŸ”§ Chore (general maintenance)
  • πŸ”’ Security update
  • πŸ’₯ Breaking change (fix or feature that would cause existing functionality to not work as expected)

βœ… Checklist

Before you submit your pull request, please make sure you have completed the following steps:

  • πŸ“š I have made the necessary updates to the documentation (if applicable).
  • πŸ§ͺ I have written tests that support my changes and prove that my fix is effective or my feature works (if applicable).
  • 🏷️ My PR title follows conventional commit format.

For more information about code review checklists, see the Code Review Checklist.

Signed-off-by: Ashwin Vaidya <[email protected]>
Copy link
Contributor

@maxxgx maxxgx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @ashwinvaidya17. Looks good overall, left some comments

@ashwinvaidya17 ashwinvaidya17 marked this pull request as draft October 31, 2025 08:44
Copilot AI review requested due to automatic review settings October 31, 2025 08:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a progress bar feature to the inspect interface that displays real-time training progress with the ability to cancel jobs. The implementation includes both frontend UI components and backend progress tracking infrastructure.

Key changes:

  • Adds a new status bar component with integrated progress display for training jobs
  • Implements real-time progress tracking using Server-Sent Events (SSE) and Lightning callbacks
  • Introduces job cancellation functionality with corresponding API endpoints

Reviewed Changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
application/ui/src/routes/inspect/inspect.tsx Adds StatusBar component to the inspect layout
application/ui/src/features/inspect/statusbar/statusbar.component.tsx Creates status bar container component
application/ui/src/features/inspect/statusbar/items/progressbar.component.tsx Implements progress bar with job polling and cancellation
application/backend/src/utils/callbacks.py Adds Lightning callback for progress synchronization
application/backend/src/services/training_service.py Integrates progress tracking into training workflow
application/backend/src/services/job_service.py Adds progress streaming and job cancellation methods
application/backend/src/pydantic_models/job.py Extends job model with stage field and cancellation response
application/backend/src/db/schema.py Adds stage column to job database schema
application/backend/src/api/endpoints/job_endpoints.py Adds progress streaming and cancellation endpoints
Files not reviewed (1)
  • application/ui/package-lock.json: Language not supported

πŸ’‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 92 to 93
def setup(self, trainer: Trainer, pl_module: LightningModule, stage: str) -> None:
self._send_progress(0, stage)
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stage parameter should be of type RunningStage to match the _send_progress method signature, but it's declared as str. This will cause a type error.

Copilot uses AI. Check for mistakes.
Comment on lines 95 to 96
def teardown(self, trainer: Trainer, pl_module: LightningModule, stage: RunningStage) -> None:
self._send_progress(1.0, stage)
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The teardown method signature uses RunningStage type but the setup method uses str type. These should be consistent, and both should use RunningStage type.

Copilot uses AI. Check for mistakes.
cached_still_running = await cls.is_job_still_running(job_id=job_id)
last_status_check = now
still_running = cached_still_running
yield json.dumps({"progress": job.progress, "stage": job.stage})
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job object is fetched once outside the loop but never refreshed. Progress and stage values will remain static throughout the streaming. The job should be refetched from the database in each iteration.

Copilot uses AI. Check for mistakes.
Signed-off-by: Ashwin Vaidya <[email protected]>
@ashwinvaidya17 ashwinvaidya17 marked this pull request as ready for review October 31, 2025 13:01
Copilot AI review requested due to automatic review settings October 31, 2025 13:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Files not reviewed (1)
  • application/ui/package-lock.json: Language not supported

πŸ’‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings October 31, 2025 13:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

Files not reviewed (1)
  • application/ui/package-lock.json: Language not supported

πŸ’‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

id: Mapped[str] = mapped_column(primary_key=True, default=lambda: str(uuid4()))
project_id: Mapped[str] = mapped_column(ForeignKey("projects.id"))
type: Mapped[str] = mapped_column(String(64), nullable=False)
stage: Mapped[JobStage] = mapped_column(Enum(JobStage), nullable=False)
Copy link
Contributor

@maxxgx maxxgx Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of declaring an Enum column in the DB schema?

What is the values that is actually stored in DB, e.g. int or str?

I wonder what happens if you add/remove job stages, will queries still work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure how it will respond when we modify it. I was following Mark's suggestion #3045 (comment). Maybe we can achieve this in a different manner?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run some testing with Enum columns. It seems to be safe, at least with StrEnum, it saves the enum as a string. Adding/removing values in the code does not seem to require the DB to be updated.

Comment on lines 26 to 37
class JobStage(StrEnum):
"""Job stages follow PyTorch Lightning stages with the addition of idle stage.
See ``lightning.pytorch.trainer.states.RunningStage`` for more details.
"""

IDLE = "idle"
TRAINING = "train"
SANITY_CHECKING = "sanity_check"
VALIDATING = "validate"
TESTING = "test"
PREDICTING = "predict"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we are going to have other types of jobs in the future, but these are only related to training and there is overlap with job.status. Perhaps, it's more appropriate to store the training stage as part of job.message string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to extract it from the string if we want to show it in the progress bar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job message can be reset based on the current stage. So you could just use the whole message.

ATM, it seems you're not setting job.message at all.

In Geti classic, job.step.message was displayed:
image

Looks like JobStage corresponds to job.step in the image above. However, our job progress is global, so we don't have this mechanism of tracking individual stage progress. Not sure how well this works in terms of UX: users would see progress resetting from 100 to 0 (e.g. from stage "training" to "validate"), but they don't know the total number of stages elapsed or remaining.

Comment on lines +106 to +107
logger.debug("Syncing progress with db stopped")
synchronization_task.cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could put this 2 lines in a finally block

# Test callbacks
def on_test_start(self, trainer: Trainer, pl_module: LightningModule) -> None:
"""Called when testing starts."""
self._send_progress(0, JobStage.TESTING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the progress is 0 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to treat each stage as having different progress instead of clubbing train->test as a single progress.
Currently the progressbar will first show the training progress, and then the testing progress. This just makes the progress a bit verbose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment above regarding this "staged" progress tracking approach.

Co-authored-by: Max Xiang <[email protected]>
Signed-off-by: Ashwin Vaidya <[email protected]>
Copilot AI review requested due to automatic review settings November 3, 2025 10:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • application/ui/package-lock.json: Language not supported

πŸ’‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Capture pytorch stdout logs into logger
with redirect_stdout(LoggerStdoutWriter()): # type: ignore[type-var]
engine.train(model=anomalib_model, datamodule=datamodule)
engine.fit(model=anomalib_model, datamodule=datamodule)
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from engine.train() to engine.fit() - this appears to be an API update but should be verified that this method exists and provides the same functionality.

Copilot uses AI. Check for mistakes.
Signed-off-by: Ashwin Vaidya <[email protected]>
@maxxgx maxxgx linked an issue Nov 3, 2025 that may be closed by this pull request
</InferenceProvider>
</SelectedMediaItemProvider>
</Grid>
<div style={{ display: 'flex', flexDirection: 'column', height: '100%' }}>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert this change, the extra div is not needed.

</Grid>
<div style={{ display: 'flex', flexDirection: 'column', height: '100%' }}>
<Grid
areas={['toolbar sidebar', 'canvas sidebar']}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
areas={['toolbar sidebar', 'canvas sidebar']}
areas={['toolbar sidebar', 'canvas sidebar', 'footer sidebar']}

The Grid component is essentially a wrapper around css's grid properties we can add a new footer grid cell that is used for displaying the sidebar.

<div style={{ display: 'flex', flexDirection: 'column', height: '100%' }}>
<Grid
areas={['toolbar sidebar', 'canvas sidebar']}
rows={['size-800', 'minmax(0, 1fr)']}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the above change this should make it so that the footer takes up a minimal height, (do check this though, we might need to tweak this)

Suggested change
rows={['size-800', 'minmax(0, 1fr)']}
rows={['size-800', 'minmax(0, 1fr)', 'auto']}

Comment on lines 34 to 38
<Sidebar />
</InferenceProvider>
</SelectedMediaItemProvider>
</Grid>
<StatusBar />

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<Sidebar />
</InferenceProvider>
</SelectedMediaItemProvider>
</Grid>
<StatusBar />
<Sidebar />
<Footer />
</InferenceProvider>
</SelectedMediaItemProvider>
</Grid>

Let's move the statusbar into a Footer component and put it inside of the grid.

Comment on lines 8 to 10
export const StatusBar = () => {
return (
<View gridArea={'statusbar'} backgroundColor={'gray-100'} width={'100%'} height={'30px'} overflow={'hidden'}>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export const StatusBar = () => {
return (
<View gridArea={'statusbar'} backgroundColor={'gray-100'} width={'100%'} height={'30px'} overflow={'hidden'}>
export const Footer = () => {
return (
<View gridArea={'footer'} backgroundColor={'gray-100'} width={'100%'} height={'size-400'} overflow={'hidden'}>

For spacing we try to adhere to spectrum's styling guidelines. This makes it so that the applications' UI is more consistent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this component feels very non-idomatic react code. I will send you a component that should follow more of our guidelines.

Signed-off-by: Ashwin Vaidya <[email protected]>
Copilot AI review requested due to automatic review settings November 4, 2025 10:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 5 comments.

Files not reviewed (1)
  • application/ui/package-lock.json: Language not supported

πŸ’‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return (
<Flex direction='column' gap='size-25'>
{query.data?.map((line, idx) => <Text key={idx}> {line}</Text>)}
{query.data?.map((line, idx) => <Text key={idx}> {line.text}</Text>)}
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code attempts to access line.text but the fetchSSE function now yields the entire parsed JSON object. This will cause a runtime error if the object doesn't have a text property.

Suggested change
{query.data?.map((line, idx) => <Text key={idx}> {line.text}</Text>)}
{query.data?.map((line, idx) => (
<Text key={idx}>
{typeof line === 'object' && line !== null && 'text' in line
? line.text
: JSON.stringify(line)}
</Text>
))}

Copilot uses AI. Check for mistakes.
engine = Engine(
default_root_dir=model.export_path,
logger=[trackio, tensorboard],
devices=[0], # Only single GPU training is supported for now
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded device index [0] should be derived from the device parameter or made configurable. This creates inconsistency with the device parameter handling.

Copilot uses AI. Check for mistakes.
Comment on lines +106 to +108
finally:
logger.debug("Syncing progress with db stopped")
synchronization_task.cancel()
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synchronization_task variable is referenced in the finally block but may not be defined if an exception occurs before line 79-83. This will cause an UnboundLocalError.

Copilot uses AI. Check for mistakes.

def on_test_epoch_end(self, trainer: Trainer, pl_module: LightningModule) -> None:
"""Called when a test epoch ends."""
progress = (trainer.current_epoch + 1) / trainer.max_epochs if trainer.max_epochs else 0.5
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 0.5 (50%) is used as fallback progress when max_epochs is not available. This should be documented or made configurable to clarify the intended behavior.

Copilot uses AI. Check for mistakes.

def on_predict_epoch_end(self, trainer: Trainer, pl_module: LightningModule) -> None:
"""Called when a prediction epoch ends."""
progress = (trainer.current_epoch + 1) / trainer.max_epochs if trainer.max_epochs else 0.5
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated progress calculation logic with magic number 0.5. This should be extracted to a helper method to avoid code duplication and ensure consistency.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

πŸ“‹ [TASK] Add training progress

3 participants