An AI-Powered Multimodal Desktop Companion
Showcasing the future of human-AI interaction through voice, vision, and intelligent conversation
ServoSkull is a cutting-edge AI desktop companion that demonstrates the potential of multimodal AI agents. Built by Usual Expat Limited, this project explores advanced AI interaction capabilities through seamless integration of voice recognition, computer vision, and natural language processing.
🚀 Future Vision: ServoSkull is designed with robotics integration in mind - imagine an AI that can see, hear, speak, and eventually control servo motors to move and interact with the physical world based on conversation and visual input.
- Smart Voice Detection: Advanced silence detection with configurable thresholds
- Real-time Processing: Instant voice-to-text using OpenAI Whisper
- Natural Conversations: Context-aware responses with conversation history
- Text-to-Speech: High-quality AI voice responses
- Live Camera Feed: Real-time webcam integration with frame capture
- Visual Context: AI analyzes images sent with each message
- Multi-resolution Support: Adaptive camera resolution handling
- Privacy-First: Local processing with secure cleanup
- Multimodal Processing: Combines text, audio, and visual inputs
- OpenAI Integration: GPT models, Whisper transcription, and TTS
- Session Management: Persistent conversation context
- Real-time Communication: SignalR-based instant messaging
- Microservices: .NET Aspire orchestration for scalability
- Reactive Frontend: Angular 19 with RxJS state management
- Production Ready: Docker containerization and monitoring
- Developer Experience: Hot reload, comprehensive logging, and debugging tools
- .NET 9.0 SDK (for Aspire and backend services)
- Node.js v22+ (for Angular frontend)
- Modern Browser with WebRTC support (Chrome/Edge recommended)
- OpenAI API Key (for AI functionality)
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 8GB | 16GB+ |
| CPU | 4 cores | 8 cores+ |
| Storage | 2GB | 5GB+ |
| Browser | Chrome 90+ | Chrome/Edge Latest |
1. Clone and Navigate
git clone <repository-url>
cd desktop-companion2. Configure OpenAI API
cd ServoSkull.ApiService
dotnet user-secrets init
dotnet user-secrets set "OpenAI:ApiKey" "<your-api-key-here>"
cd ..3. Install Dependencies
# Install Angular dependencies
cd ServoSkull.Angular
npm install
cd ..
# Install Tailwind CLI globally (optional)
npm install -g @tailwindcss/cli
cd ServoSkull.Web
npm install
cd ..4. Start the Application
# Start all services with Aspire orchestration
dotnet run --project ServoSkull.AppHost5. Access the Application
- Aspire Dashboard:
http://localhost:18888(service monitoring) - Angular App:
http://localhost:4200(main interface) - API Service:
http://localhost:5000(backend API)
The application uses .NET Aspire for cloud-ready distributed application development and orchestration.
ServoSkull.AppHost/ # Aspire host application
├── Program.cs # Service orchestration
└── appsettings.json # Host configuration
ServoSkull.ServiceDefaults/ # Shared service configurations
├── Extensions.cs # Service collection extensions
└── OpenTelemetry.cs # Telemetry configuration
ServoSkull.ApiService/ # Backend API service
└── Program.cs # API service entry point
-
Start the Aspire Host:
dotnet run --project ServoSkull.AppHost
This will:
- Start the Aspire dashboard
- Launch the API service
- Configure service discovery
- Initialize telemetry collection
-
Access the Dashboard:
- Open the dashboard URL in your browser
- Monitor service health
- View logs and telemetry
- Check service dependencies
-
Development Mode:
# Run with hot reload dotnet watch run --project ServoSkull.AppHost # Run with detailed logging dotnet run --project ServoSkull.AppHost --verbose
For production deployment, Aspire provides:
- Container orchestration
- Environment-specific configurations
- Health check endpoints
- Metrics collection
- Distributed tracing
Configure production settings in appsettings.Production.json:
{
"Aspire": {
"Telemetry": {
"Endpoint": "your-telemetry-endpoint",
"Protocol": "grpc"
},
"Resilience": {
"CircuitBreaker": {
"SamplingDuration": "00:00:10"
}
}
}
}-
Access the Aspire dashboard for:
- Service status and health
- Log aggregation
- Performance metrics
- Dependency mapping
- Configuration validation
-
Integrated logging with structured data:
logger.LogInformation("Service {ServiceName} started", serviceName);
-
Health check endpoints:
curl http://localhost:18888/health
AudioService: Handles voice detection and recordingWebcamService: Manages webcam streams- Both services follow Angular's dependency injection pattern
- Uses RxJS BehaviorSubjects for state management
- Provides Observable streams for reactive updates
- Maintains clean separation of concerns
- Comprehensive error handling for media devices
- Detailed logging for debugging
- User-friendly error messages
sequenceDiagram
participant User
participant UI
participant AudioService
participant WebcamService
participant SignalRService
participant AspireHost
participant Backend
User->>UI: Open Application
AspireHost->>Backend: Start Services
UI->>SignalRService: Initialize Connection
SignalRService->>Backend: Establish WebSocket
par Audio Stream
User->>UI: Enable Microphone
UI->>AudioService: Start Monitoring
AudioService->>AudioService: Setup Voice Detection
loop Voice Detection
AudioService->>AudioService: Monitor Audio Levels
alt Voice Detected
AudioService->>AudioService: Start Recording
else Silence Detected
AudioService->>AudioService: Stop Recording
AudioService->>SignalRService: Send Audio
SignalRService->>Backend: Process Audio
end
end
and Webcam Stream
User->>UI: Enable Camera
UI->>WebcamService: Start Stream
WebcamService->>WebcamService: Initialize Video
loop Frame Capture
WebcamService->>UI: Update Preview
end
end
flowchart TD
A[Start Audio Monitoring] --> B{Check Audio Level}
B -->|Level > Start Threshold| C[Start Recording]
B -->|Level <= Start Threshold| B
C --> D{Check Audio Level}
D -->|Level > Stop Threshold| D
D -->|Level <= Stop Threshold| E[Increment Silent Frames]
E --> F{Silent Frames > Threshold?}
F -->|No| D
F -->|Yes| G[Stop Recording]
G --> H[Emit Audio Blob]
H --> B
graph TD
subgraph Frontend
A[App Component]
B[Audio Controls]
C[Webcam Controls]
D[Chat UI]
subgraph Services
E[Audio Service]
F[Webcam Service]
G[SignalR Service]
end
A --> B & C & D
B --> E
C --> F
D --> G
end
subgraph Backend
H[Aspire Host]
I[API Service]
J[Service Defaults]
H --> I
H --> J
end
G <--> I
stateDiagram-v2
[*] --> Idle
Idle --> Monitoring: startMonitoring()
Monitoring --> Recording: voiceDetected
Recording --> Monitoring: silenceDetected
Recording --> Processing: stopRecording()
Processing --> Monitoring: processingComplete
Monitoring --> Idle: stopMonitoring()
state Monitoring {
[*] --> CheckingLevels
CheckingLevels --> CheckingLevels: levelBelowThreshold
CheckingLevels --> [*]: levelAboveThreshold
}
state Recording {
[*] --> Active
Active --> CountingSilence: levelBelowThreshold
CountingSilence --> Active: levelAboveThreshold
CountingSilence --> [*]: silenceThresholdReached
}
flowchart TD
A[Start Operation] --> B{Check Permissions}
B -->|Denied| C[Show Permission Error]
B -->|Granted| D{Initialize Device}
D -->|Success| E[Start Stream]
D -->|Failure| F[Show Device Error]
E --> G{Monitor Stream}
G -->|Error| H{Error Type}
H -->|Recoverable| I[Attempt Recovery]
H -->|Fatal| J[Stop Stream]
I -->|Success| G
I -->|Failure| J
J --> K[Cleanup Resources]
K --> L[Show Error Message]
The application uses the WebRTC API to access the user's camera through the WebcamService. Here's how it works:
-
Stream Initialization:
const stream = await navigator.mediaDevices.getUserMedia({ video: { width: { ideal: 640 }, height: { ideal: 480 }, facingMode: 'user' } });
-
Frame Capture:
-
Primary method uses the modern
ImageCaptureAPI:const imageCapture = new ImageCapture(videoTrack); const blob = await imageCapture.takePhoto();
-
Fallback to Canvas API if
ImageCaptureis not supported:const canvas = document.createElement('canvas'); const ctx = canvas.getContext('2d'); ctx.drawImage(videoElement, 0, 0); const dataUrl = canvas.toDataURL('image/png');
-
-
Resource Management:
- Proper cleanup of video tracks when stopping the stream
- Automatic resource release when component is destroyed
- Server-side rendering (SSR) safety checks
The application uses the Web Audio API and MediaRecorder API for sophisticated audio handling through the AudioService. The service integrates with SignalR for real-time communication and provides reactive state management:
-
Audio Configuration:
interface AudioConfig { sampleRate: number; // Audio sampling rate channels: number; // Number of audio channels startThreshold: number; // Volume threshold to start recording stopThreshold: number; // Lower threshold to maintain recording silenceThreshold: number; // Time in ms to consider silence smoothingTimeConstant: number; // Smoothing factor for analysis } // Default configuration const defaultConfig = { sampleRate: 16000, // 16kHz for optimal speech channels: 1, // Mono audio startThreshold: 0.24, // Start at 24% volume stopThreshold: 0.15, // Keep recording until 15% silenceThreshold: 2000, // Stop after 2s silence smoothingTimeConstant: 0.8 };
-
State Management:
interface AudioMonitorState { isMonitoring: boolean; // Audio analysis active isRecording: boolean; // Currently recording voiceDetected: boolean; // Voice detected audioLevel: number; // Current volume (0-1) } interface AudioPlaybackState { isPlaying: boolean; duration: number; currentTime: number; } // Service provides observables for state audioService.monitorState$: Observable<AudioMonitorState> audioService.playbackState$: Observable<AudioPlaybackState> audioService.isRecording$: Observable<boolean>
-
Audio Stream Setup:
const constraints: MediaStreamConstraints = { audio: { sampleRate: config.sampleRate, channelCount: config.channels, echoCancellation: true, noiseSuppression: true, autoGainControl: true } };
-
Voice Detection System:
-
Uses Web Audio API's
AnalyserNodefor real-time analysis -
Calculates RMS (Root Mean Square) for natural volume measurement
-
Implements dual-threshold approach with hysteresis:
// Calculate RMS value let sum = 0; let nonZeroCount = 0; for (let i = 0; i < bufferLength; i++) { const value = dataArray[i] / 255; if (value > 0) { sum += value * value; nonZeroCount++; } } const normalizedLevel = Math.sqrt(sum / (nonZeroCount || bufferLength));
-
-
Recording Management:
- Automatic start on voice detection
- Smart silence detection for auto-stop
// Start recording when voice detected if (voiceDetected && !currentState.isRecording) { this.startRecordingInternal(); lowVolumeFrames = 0; } // Stop after sustained silence if (lowVolumeFrames >= framesToWait) { this.stopRecordingInternal(); }
-
Audio Playback:
- Supports multiple audio formats
- Handles base64 encoded audio data
- Provides playback controls and state
async playAudio(base64Audio: string): Promise<void> { const audioBlob = this.base64ToBlob(base64Audio); const audioUrl = URL.createObjectURL(audioBlob); this.currentAudio = new Audio(audioUrl); await this.currentAudio.play(); }
-
Integration with Chat:
// Chat component integration this.audioService.audioRecorded$ .pipe(takeUntil(this.destroy$)) .subscribe(async audioBlob => { await this.signalRService.sendAudioMessage(audioBlob); }); // Handle playback in chat async handleAudioPlayback(message: ChatMessage): Promise<void> { if (this.isPlayingAudio(message)) { await this.audioService.stopPlayback(); } else { await this.audioService.playAudio(message.audioData!); } }
-
Error Handling:
- Comprehensive error states
- Automatic recovery attempts
- SSR (Server-Side Rendering) safety checks
if (!this.isBrowser) { return throwError(() => new Error('Audio capture not available during SSR') ); }
The system provides:
- Automatic voice-activated recording
- Real-time audio level monitoring
- Smooth playback experience
- Integration with chat interface
- Proper resource management
- Error resilience and recovery
The application implements real-time voice detection using Web Audio API's AnalyserNode to monitor audio input and automatically manage recording. Let's dive into how this sophisticated system works:
The Web Audio API provides a powerful audio processing pipeline through the AnalyserNode. Here's how we set it up:
private setupAudioAnalysis(stream: MediaStream, config: AudioConfig) {
// Create audio context and analyzer
this.audioContext = new AudioContext();
this.analyzer = this.audioContext.createAnalyser();
// Configure for optimal voice detection
this.analyzer.fftSize = 2048; // For detailed frequency analysis
this.analyzer.smoothingTimeConstant = config.smoothingTimeConstant;
// Connect stream to analyzer
const source = this.audioContext.createMediaStreamSource(stream);
source.connect(this.analyzer);
// Start monitoring if enabled
if (this.monitorState.value.isMonitoring) {
requestAnimationFrame(this.checkAudioLevel);
}
}Why these settings?
fftSize = 2048: This gives us 1024 frequency bins (fftSize/2), providing enough detail to analyze human voice frequencies (typically 85-255 Hz) while maintaining good performance.smoothingTimeConstant: Acts like a low-pass filter, smoothing out rapid fluctuations in the audio signal. A value of 0.8 means each new value is weighted at 20%, preventing false triggers from brief spikes.
We use RMS (Root Mean Square) calculation for volume measurement. But why RMS instead of a simple average?
// Get frequency data from analyzer
const dataArray = new Uint8Array(bufferLength);
this.analyzer.getByteFrequencyData(dataArray);
// Calculate RMS with improved accuracy
let sum = 0;
let nonZeroCount = 0;
for (let i = 0; i < bufferLength; i++) {
const value = dataArray[i] / 255;
if (value > 0) {
sum += value * value;
nonZeroCount++;
}
}
const normalizedLevel = Math.sqrt(sum / (nonZeroCount || bufferLength));Understanding RMS:
-
Why RMS? Human perception of sound intensity is logarithmic, not linear. RMS better represents how we perceive loudness because it:
- Emphasizes larger values (squaring)
- Handles both positive and negative sound waves
- Correlates better with perceived volume than simple averaging
-
Implementation Details:
- We normalize values to 0-1 range (
/ 255) for consistent thresholds - We count non-zero values to handle silence more accurately
- The square root at the end converts back to a linear scale
- We normalize values to 0-1 range (
Our system uses a dual-threshold approach with frame counting for robust voice detection. This sophisticated approach prevents false triggers while maintaining natural conversation flow:
// Handle frame counting for silence detection
if (currentState.isRecording) {
if (!voiceDetected) {
lowVolumeFrames++;
if (lowVolumeFrames >= framesToWait) {
this.stopRecordingInternal();
lowVolumeFrames = 0;
}
} else {
lowVolumeFrames = 0;
}
}
// Handle recording state
if (voiceDetected && !currentState.isRecording) {
this.startRecordingInternal();
}The Dual-Threshold System Explained:
-
Start Threshold (0.24 or 24%):
- Higher threshold for starting recording
- Prevents false triggers from background noise
- Chosen based on typical speech volume patterns
-
Stop Threshold (0.15 or 15%):
- Lower threshold for maintaining recording
- Allows natural pauses in speech
- Prevents cutting off quiet syllables
-
Frame-Based Silence Detection:
- Counts frames below stop threshold
- 2-second timeout = 120 frames at 60fps
- Why 2 seconds? Studies show it's a natural pause length in conversation
- Resets counter when voice is detected again
This creates a "hysteresis" effect:
Volume
^
| Start Recording
0.24| ┌───────────────────────────────┐
| │ Keep Recording │
0.15| │ │
| │ └─────
└───┘ Stop recording
Time
The service provides reactive state observables for real-time monitoring:
interface AudioMonitorState {
isMonitoring: boolean; // Audio analysis active
isRecording: boolean; // Currently recording
voiceDetected: boolean; // Voice detected
audioLevel: number; // Current volume (0-1)
}
// Components can subscribe to state changes
audioService.monitorState$.subscribe(state => {
updateUI(state);
});
// Monitor state changes trigger UI updates
private monitorState = new BehaviorSubject<AudioMonitorState>({
isMonitoring: false,
isRecording: false,
voiceDetected: false,
audioLevel: 0
});The chat component handles audio recording and playback:
export class ChatComponent {
constructor(
private audioService: AudioService,
private signalRService: SignalRService
) {
// Handle completed recordings
this.audioService.audioRecorded$
.pipe(takeUntil(this.destroy$))
.subscribe(async audioBlob => {
try {
// Send audio through SignalR
await this.signalRService.sendAudioMessage(audioBlob);
} catch (error) {
console.error('Error sending audio:', error);
}
});
// Handle audio playback state
this.audioService.playbackState$
.pipe(takeUntil(this.destroy$))
.subscribe(state => {
this.isAudioPlaying = state.isPlaying;
this.cdr.markForCheck();
});
}
// Handle message playback
async handleAudioPlayback(message: ChatMessage): Promise<void> {
if (this.isPlayingAudio(message)) {
await this.audioService.stopPlayback();
} else {
await this.audioService.playAudio(message.audioData!);
}
}
}The service implements comprehensive resource cleanup:
private cleanupAudioResources(): void {
// Stop monitoring
this.monitorState.next({
isMonitoring: false,
isRecording: false,
voiceDetected: false,
audioLevel: 0
});
// Clean up analyzer
if (this.analyzer) {
this.analyzer.disconnect();
this.analyzer = null;
}
// Close audio context
if (this.audioContext?.state !== 'closed') {
this.audioContext.close();
}
this.audioContext = null;
// Stop recording
if (this.mediaRecorder?.state === 'recording') {
this.mediaRecorder.stop();
}
this.mediaRecorder = null;
this.audioChunks = [];
this.checkAudioLevel = null;
}- Chrome/Edge (recommended)
- Firefox
- Safari (limited support)
- Audio recording requires explicit user permission
- Webcam access requires HTTPS in production
- Some browsers may have limited codec support
Comprehensive documentation is available in the docs/ directory:
- Product Overview - Executive summary and value proposition
- Architecture Guide - Technical deep-dive and system design
- Demo Guide - Interactive demonstration flows
- Development Guide - Complete developer setup and contribution guide
Contributions are welcome! This project follows clean architecture principles and modern development practices. Please see the Development Guide for:
- Development environment setup
- Code architecture patterns
- Testing strategies
- Pull request process
ServoSkull is developed by Usual Expat Limited as a demonstration platform for advanced AI interaction capabilities. This project showcases the potential of multimodal AI agents and serves as a foundation for future robotics integration.
MIT License - see LICENSE file for details.