"Information should not be displayed all at once; let people gradually become familiar with it." - Edward Tufte
Tired of playing "Find the Important Number" in your terminal while your ML model trains? Watching your GPU temperature shouldn't feel like decoding the Matrix.
Transform complex GPU metrics into intuitive visual patterns
Do you find yourself:
- Constantly switching between terminal windows?
- Squinting at rapidly changing numbers?
- Missing critical spikes in GPU usage?
- Wondering if that temperature is actually concerning?
- Spending mental energy parsing dense data when you should be focusing on your work?
We've reimagined GPU monitoring with human-centered design principles. Instead of parsing dense terminal output, our dashboard leverages intuitive visual affordances - color gradients that immediately signal temperature states, progress bars that show memory usage at a glance, and trend indicators that make pattern recognition effortless.
This beautiful, real-time GPU monitoring dashboard transforms complex metrics into an intuitive interface. Built with React and Flask, it reduces cognitive load through thoughtful information hierarchy and visual signifiers, letting you focus on your work while maintaining awareness of your GPU's health.
Key Design Principles:
- Reduced Cognitive Load: Visual patterns over dense numbers
- Intuitive Signifiers: Color-coding that maps to severity levels
- Information Hierarchy: Critical metrics prominently displayed
- Pattern Recognition: Trend visualization for quick analysis
- Attention Management: Alerts that demand attention only when needed
Traditional nvidia-smi command line output - dense numbers requiring constant cognitive processing
Dashboard showing Ollama running the Mistral-Small model - clear resource utilization
Intensive GPU stress testing with gpu-burn - immediate visual alerts
Responsive design automatically adapts to any screen size
- Frontend: React 18, TypeScript, Vite
- Backend: Python, Flask
- System Tools:
- nvidia-smi
- gpu-burn
- CUDA Toolkit
-
Clone the repository:
git clone https://github.com/jackccrawford/nvidia-gpu-perf-monitor.git
-
Install backend dependencies:
cd backend pip install -r requirements.txt
-
Install frontend dependencies:
cd frontend npm install
-
Start the application:
./restart.sh
The
restart.sh
script handles:- Stopping any existing instances of the frontend and backend services
- Starting the Flask backend server
- Starting the React frontend development server
- Ensuring proper startup sequence and port availability
Pro Tip: Always use
restart.sh
to start/restart the application. It ensures clean startup and proper service coordination.If you need to start services manually (not recommended):
# Backend cd backend python gpu_service.py # Frontend (in a new terminal) cd frontend npm run dev
The project includes scripts for managing the frontend and backend services:
- restart.sh: Restarts both the frontend and backend services with a graceful shutdown method, ensuring reliable operation.
- stop_servers.sh: Stops the services gracefully, freeing up ports and ensuring no residual processes remain.
These scripts are designed for development use and may require adjustments for production environments.
The frontend application uses a polling mechanism to fetch GPU statistics at regular intervals. Recent improvements include:
- Enhanced logging to track fetch operation timing and identify potential delays.
- Improved error handling to capture detailed error information during data fetching.
Recent efforts focused on resolving an intermittent refresh issue in the frontend. Key actions included:
- Adding detailed console logging to monitor API calls and responses.
- Verifying backend API response consistency and improving state management in the React component.
These changes have stabilized the application's performance, ensuring accurate and timely GPU monitoring.
# Run your ML training
python train.py --model large --epochs 100
Monitor in real-time:
- GPU utilization during training
- Memory consumption patterns
- Temperature trends
- Process-specific metrics
# Run gpu-burn
./gpu-burn 60 # 1 minute stress test - watch closely!
Monitor in dashboard:
- Temperature peaks
- Error detection
- Duration tracking
- Performance metrics
CAUTION: Never run stress tests for extended periods. This can damage your GPU!
Color-coded temperature ranges for intuitive monitoring:
- Red (≥85°C): Danger zone
- Orange (80-84°C): Warning
- Yellow (70-79°C): Normal gaming temp
- Green (65-69°C): Ideal temperature
- Blue (50-64°C): Cool
- Indigo (<50°C): Very cool
- No sensitive data collection
- Local-only operation
- Process information filtered
- Safe subprocess execution
- Error handling for all operations
This project is licensed under the MIT License - see the LICENSE file for details.
- NVIDIA for nvidia-smi toolkit
- gpu-burn for stress testing
- React for frontend framework
- Flask for backend framework
- Icons by Heroicons and Phosphor
Made with for the GPU community
Developed with assistance by Codeium Windsurf