ServoSkull 🤖

An AI-Powered Multimodal Desktop Companion

Showcasing the future of human-AI interaction through voice, vision, and intelligent conversation

ServoSkull is a cutting-edge AI desktop companion that demonstrates the potential of multimodal AI agents. Built by Usual Expat Limited, this project explores advanced AI interaction capabilities through seamless integration of voice recognition, computer vision, and natural language processing.

🚀 Future Vision: ServoSkull is designed with robotics integration in mind - imagine an AI that can see, hear, speak, and eventually control servo motors to move and interact with the physical world based on conversation and visual input.

✨ Key Features

🎤 Voice-Activated AI Interaction

Smart Voice Detection: Advanced silence detection with configurable thresholds
Real-time Processing: Instant voice-to-text using OpenAI Whisper
Natural Conversations: Context-aware responses with conversation history
Text-to-Speech: High-quality AI voice responses

👁️ Computer Vision Integration

Live Camera Feed: Real-time webcam integration with frame capture
Visual Context: AI analyzes images sent with each message
Multi-resolution Support: Adaptive camera resolution handling
Privacy-First: Local processing with secure cleanup

🧠 Advanced AI Capabilities

Multimodal Processing: Combines text, audio, and visual inputs
OpenAI Integration: GPT models, Whisper transcription, and TTS
Session Management: Persistent conversation context
Real-time Communication: SignalR-based instant messaging

⚡ Modern Architecture

Microservices: .NET Aspire orchestration for scalability
Reactive Frontend: Angular 19 with RxJS state management
Production Ready: Docker containerization and monitoring
Developer Experience: Hot reload, comprehensive logging, and debugging tools

🚀 Quick Start

Prerequisites

.NET 9.0 SDK (for Aspire and backend services)
Node.js v22+ (for Angular frontend)
Modern Browser with WebRTC support (Chrome/Edge recommended)
OpenAI API Key (for AI functionality)

System Requirements

Component	Minimum	Recommended
RAM	8GB	16GB+
CPU	4 cores	8 cores+
Storage	2GB	5GB+
Browser	Chrome 90+	Chrome/Edge Latest

Installation & Setup

1. Clone and Navigate

git clone <repository-url>
cd desktop-companion

2. Configure OpenAI API

cd ServoSkull.ApiService
dotnet user-secrets init
dotnet user-secrets set "OpenAI:ApiKey" "<your-api-key-here>"
cd ..

3. Install Dependencies

# Install Angular dependencies
cd ServoSkull.Angular
npm install
cd ..

# Install Tailwind CLI globally (optional)
npm install -g @tailwindcss/cli

cd ServoSkull.Web
npm install
cd ..

4. Start the Application

# Start all services with Aspire orchestration
dotnet run --project ServoSkull.AppHost

5. Access the Application

Aspire Dashboard: http://localhost:18888 (service monitoring)
Angular App: http://localhost:4200 (main interface)
API Service: http://localhost:5000 (backend API)

.NET Aspire Integration

The application uses .NET Aspire for cloud-ready distributed application development and orchestration.

App Host Structure

ServoSkull.AppHost/              # Aspire host application
├── Program.cs                   # Service orchestration
└── appsettings.json            # Host configuration

ServoSkull.ServiceDefaults/      # Shared service configurations
├── Extensions.cs               # Service collection extensions
└── OpenTelemetry.cs           # Telemetry configuration

ServoSkull.ApiService/          # Backend API service
└── Program.cs                 # API service entry point

Development Workflow

Start the Aspire Host:
```
dotnet run --project ServoSkull.AppHost
```
This will:
- Start the Aspire dashboard
- Launch the API service
- Configure service discovery
- Initialize telemetry collection
Access the Dashboard:
- Open the dashboard URL in your browser
- Monitor service health
- View logs and telemetry
- Check service dependencies

Development Mode:

# Run with hot reload
dotnet watch run --project ServoSkull.AppHost

# Run with detailed logging
dotnet run --project ServoSkull.AppHost --verbose

Production Deployment

For production deployment, Aspire provides:

Container orchestration
Environment-specific configurations
Health check endpoints
Metrics collection
Distributed tracing

Configure production settings in appsettings.Production.json:

{
  "Aspire": {
    "Telemetry": {
      "Endpoint": "your-telemetry-endpoint",
      "Protocol": "grpc"
    },
    "Resilience": {
      "CircuitBreaker": {
        "SamplingDuration": "00:00:10"
      }
    }
  }
}

Monitoring and Debugging

Access the Aspire dashboard for:
- Service status and health
- Log aggregation
- Performance metrics
- Dependency mapping
- Configuration validation

Integrated logging with structured data:

logger.LogInformation("Service {ServiceName} started", serviceName);

Health check endpoints:
```
curl http://localhost:18888/health
```

System design

Frontend Services

AudioService: Handles voice detection and recording
WebcamService: Manages webcam streams
Both services follow Angular's dependency injection pattern

State Management

Uses RxJS BehaviorSubjects for state management
Provides Observable streams for reactive updates
Maintains clean separation of concerns

Error Handling

Comprehensive error handling for media devices
Detailed logging for debugging
User-friendly error messages

Diagrams

sequenceDiagram
    participant User
    participant UI
    participant AudioService
    participant WebcamService
    participant SignalRService
    participant AspireHost
    participant Backend

    User->>UI: Open Application
    AspireHost->>Backend: Start Services
    UI->>SignalRService: Initialize Connection
    SignalRService->>Backend: Establish WebSocket

    par Audio Stream
        User->>UI: Enable Microphone
        UI->>AudioService: Start Monitoring
        AudioService->>AudioService: Setup Voice Detection
        loop Voice Detection
            AudioService->>AudioService: Monitor Audio Levels
            alt Voice Detected
                AudioService->>AudioService: Start Recording
            else Silence Detected
                AudioService->>AudioService: Stop Recording
                AudioService->>SignalRService: Send Audio
                SignalRService->>Backend: Process Audio
            end
        end
    and Webcam Stream
        User->>UI: Enable Camera
        UI->>WebcamService: Start Stream
        WebcamService->>WebcamService: Initialize Video
        loop Frame Capture
            WebcamService->>UI: Update Preview
        end
    end

flowchart TD
    A[Start Audio Monitoring] --> B{Check Audio Level}
    B -->|Level > Start Threshold| C[Start Recording]
    B -->|Level <= Start Threshold| B
    C --> D{Check Audio Level}
    D -->|Level > Stop Threshold| D
    D -->|Level <= Stop Threshold| E[Increment Silent Frames]
    E --> F{Silent Frames > Threshold?}
    F -->|No| D
    F -->|Yes| G[Stop Recording]
    G --> H[Emit Audio Blob]
    H --> B

graph TD
    subgraph Frontend
        A[App Component]
        B[Audio Controls]
        C[Webcam Controls]
        D[Chat UI]
        
        subgraph Services
            E[Audio Service]
            F[Webcam Service]
            G[SignalR Service]
        end
        
        A --> B & C & D
        B --> E
        C --> F
        D --> G
    end
    
    subgraph Backend
        H[Aspire Host]
        I[API Service]
        J[Service Defaults]
        
        H --> I
        H --> J
    end
    
    G <--> I

stateDiagram-v2
    [*] --> Idle
    Idle --> Monitoring: startMonitoring()
    Monitoring --> Recording: voiceDetected
    Recording --> Monitoring: silenceDetected
    Recording --> Processing: stopRecording()
    Processing --> Monitoring: processingComplete
    Monitoring --> Idle: stopMonitoring()
    
    state Monitoring {
        [*] --> CheckingLevels
        CheckingLevels --> CheckingLevels: levelBelowThreshold
        CheckingLevels --> [*]: levelAboveThreshold
    }
    
    state Recording {
        [*] --> Active
        Active --> CountingSilence: levelBelowThreshold
        CountingSilence --> Active: levelAboveThreshold
        CountingSilence --> [*]: silenceThresholdReached
    }

flowchart TD
    A[Start Operation] --> B{Check Permissions}
    B -->|Denied| C[Show Permission Error]
    B -->|Granted| D{Initialize Device}
    D -->|Success| E[Start Stream]
    D -->|Failure| F[Show Device Error]
    E --> G{Monitor Stream}
    G -->|Error| H{Error Type}
    H -->|Recoverable| I[Attempt Recovery]
    H -->|Fatal| J[Stop Stream]
    I -->|Success| G
    I -->|Failure| J
    J --> K[Cleanup Resources]
    K --> L[Show Error Message]

Browser API Integration

Camera Integration

The application uses the WebRTC API to access the user's camera through the WebcamService. Here's how it works:

Stream Initialization:

const stream = await navigator.mediaDevices.getUserMedia({
  video: {
    width: { ideal: 640 },
    height: { ideal: 480 },
    facingMode: 'user'
  }
});

Frame Capture:

Primary method uses the modern ImageCapture API:

const imageCapture = new ImageCapture(videoTrack);
const blob = await imageCapture.takePhoto();

Fallback to Canvas API if ImageCapture is not supported:

const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
ctx.drawImage(videoElement, 0, 0);
const dataUrl = canvas.toDataURL('image/png');

Resource Management:
- Proper cleanup of video tracks when stopping the stream
- Automatic resource release when component is destroyed
- Server-side rendering (SSR) safety checks

Audio Recording and Playback

The application uses the Web Audio API and MediaRecorder API for sophisticated audio handling through the AudioService. The service integrates with SignalR for real-time communication and provides reactive state management:

Audio Configuration:

interface AudioConfig {
  sampleRate: number;      // Audio sampling rate
  channels: number;        // Number of audio channels
  startThreshold: number;  // Volume threshold to start recording
  stopThreshold: number;   // Lower threshold to maintain recording
  silenceThreshold: number;  // Time in ms to consider silence
  smoothingTimeConstant: number; // Smoothing factor for analysis
}

// Default configuration
const defaultConfig = {
  sampleRate: 16000,      // 16kHz for optimal speech
  channels: 1,            // Mono audio
  startThreshold: 0.24,   // Start at 24% volume
  stopThreshold: 0.15,    // Keep recording until 15%
  silenceThreshold: 2000, // Stop after 2s silence
  smoothingTimeConstant: 0.8
};

State Management:

interface AudioMonitorState {
  isMonitoring: boolean;    // Audio analysis active
  isRecording: boolean;     // Currently recording
  voiceDetected: boolean;   // Voice detected
  audioLevel: number;       // Current volume (0-1)
}

interface AudioPlaybackState {
  isPlaying: boolean;
  duration: number;
  currentTime: number;
}

// Service provides observables for state
audioService.monitorState$: Observable<AudioMonitorState>
audioService.playbackState$: Observable<AudioPlaybackState>
audioService.isRecording$: Observable<boolean>

Audio Stream Setup:

const constraints: MediaStreamConstraints = {
  audio: {
    sampleRate: config.sampleRate,
    channelCount: config.channels,
    echoCancellation: true,
    noiseSuppression: true,
    autoGainControl: true
  }
};

Voice Detection System:

Uses Web Audio API's AnalyserNode for real-time analysis
Calculates RMS (Root Mean Square) for natural volume measurement

Implements dual-threshold approach with hysteresis:

// Calculate RMS value
let sum = 0;
let nonZeroCount = 0;
for (let i = 0; i < bufferLength; i++) {
  const value = dataArray[i] / 255;
  if (value > 0) {
    sum += value * value;
    nonZeroCount++;
  }
}
const normalizedLevel = Math.sqrt(sum / (nonZeroCount || bufferLength));

Recording Management:

Automatic start on voice detection
Smart silence detection for auto-stop

// Start recording when voice detected
if (voiceDetected && !currentState.isRecording) {
  this.startRecordingInternal();
  lowVolumeFrames = 0;
}

// Stop after sustained silence
if (lowVolumeFrames >= framesToWait) {
  this.stopRecordingInternal();
}

Audio Playback:

Supports multiple audio formats
Handles base64 encoded audio data
Provides playback controls and state

async playAudio(base64Audio: string): Promise<void> {
  const audioBlob = this.base64ToBlob(base64Audio);
  const audioUrl = URL.createObjectURL(audioBlob);
  this.currentAudio = new Audio(audioUrl);
  await this.currentAudio.play();
}

Integration with Chat:

// Chat component integration
this.audioService.audioRecorded$
  .pipe(takeUntil(this.destroy$))
  .subscribe(async audioBlob => {
    await this.signalRService.sendAudioMessage(audioBlob);
  });

// Handle playback in chat
async handleAudioPlayback(message: ChatMessage): Promise<void> {
  if (this.isPlayingAudio(message)) {
    await this.audioService.stopPlayback();
  } else {
    await this.audioService.playAudio(message.audioData!);
  }
}

Error Handling:

Comprehensive error states
Automatic recovery attempts
SSR (Server-Side Rendering) safety checks

if (!this.isBrowser) {
  return throwError(() => 
    new Error('Audio capture not available during SSR')
  );
}

The system provides:

Automatic voice-activated recording
Real-time audio level monitoring
Smooth playback experience
Integration with chat interface
Proper resource management
Error resilience and recovery

Voice Activity Detection

The application implements real-time voice detection using Web Audio API's AnalyserNode to monitor audio input and automatically manage recording. Let's dive into how this sophisticated system works:

Audio Analysis Setup

The Web Audio API provides a powerful audio processing pipeline through the AnalyserNode. Here's how we set it up:

private setupAudioAnalysis(stream: MediaStream, config: AudioConfig) {
  // Create audio context and analyzer
  this.audioContext = new AudioContext();
  this.analyzer = this.audioContext.createAnalyser();
  
  // Configure for optimal voice detection
  this.analyzer.fftSize = 2048;  // For detailed frequency analysis
  this.analyzer.smoothingTimeConstant = config.smoothingTimeConstant;
  
  // Connect stream to analyzer
  const source = this.audioContext.createMediaStreamSource(stream);
  source.connect(this.analyzer);

  // Start monitoring if enabled
  if (this.monitorState.value.isMonitoring) {
    requestAnimationFrame(this.checkAudioLevel);
  }
}

Why these settings?

fftSize = 2048: This gives us 1024 frequency bins (fftSize/2), providing enough detail to analyze human voice frequencies (typically 85-255 Hz) while maintaining good performance.
smoothingTimeConstant: Acts like a low-pass filter, smoothing out rapid fluctuations in the audio signal. A value of 0.8 means each new value is weighted at 20%, preventing false triggers from brief spikes.

Volume Detection System

We use RMS (Root Mean Square) calculation for volume measurement. But why RMS instead of a simple average?

// Get frequency data from analyzer
const dataArray = new Uint8Array(bufferLength);
this.analyzer.getByteFrequencyData(dataArray);

// Calculate RMS with improved accuracy
let sum = 0;
let nonZeroCount = 0;
for (let i = 0; i < bufferLength; i++) {
  const value = dataArray[i] / 255;
  if (value > 0) {
    sum += value * value;
    nonZeroCount++;
  }
}
const normalizedLevel = Math.sqrt(sum / (nonZeroCount || bufferLength));

Understanding RMS:

Why RMS? Human perception of sound intensity is logarithmic, not linear. RMS better represents how we perceive loudness because it:
- Emphasizes larger values (squaring)
- Handles both positive and negative sound waves
- Correlates better with perceived volume than simple averaging
Implementation Details:
- We normalize values to 0-1 range (/ 255) for consistent thresholds
- We count non-zero values to handle silence more accurately
- The square root at the end converts back to a linear scale

Smart Voice Detection

Our system uses a dual-threshold approach with frame counting for robust voice detection. This sophisticated approach prevents false triggers while maintaining natural conversation flow:

// Handle frame counting for silence detection
if (currentState.isRecording) {
  if (!voiceDetected) {
    lowVolumeFrames++;
    if (lowVolumeFrames >= framesToWait) {
      this.stopRecordingInternal();
      lowVolumeFrames = 0;
    }
  } else {
    lowVolumeFrames = 0;
  }
}

// Handle recording state
if (voiceDetected && !currentState.isRecording) {
  this.startRecordingInternal();
}

The Dual-Threshold System Explained:

Start Threshold (0.24 or 24%):
- Higher threshold for starting recording
- Prevents false triggers from background noise
- Chosen based on typical speech volume patterns
Stop Threshold (0.15 or 15%):
- Lower threshold for maintaining recording
- Allows natural pauses in speech
- Prevents cutting off quiet syllables
Frame-Based Silence Detection:
- Counts frames below stop threshold
- 2-second timeout = 120 frames at 60fps
- Why 2 seconds? Studies show it's a natural pause length in conversation
- Resets counter when voice is detected again

This creates a "hysteresis" effect:

Volume
    ^
    |    Start Recording
0.24|   ┌───────────────────────────────┐
    |   │     Keep Recording            │
0.15|   │                               │
    |   │                               └─────
    └───┘                               Stop recording
     Time

State Management and Monitoring

The service provides reactive state observables for real-time monitoring:

interface AudioMonitorState {
  isMonitoring: boolean;    // Audio analysis active
  isRecording: boolean;     // Currently recording
  voiceDetected: boolean;   // Voice detected
  audioLevel: number;       // Current volume (0-1)
}

// Components can subscribe to state changes
audioService.monitorState$.subscribe(state => {
  updateUI(state);
});

// Monitor state changes trigger UI updates
private monitorState = new BehaviorSubject<AudioMonitorState>({
  isMonitoring: false,
  isRecording: false,
  voiceDetected: false,
  audioLevel: 0
});

Integration with Chat

The chat component handles audio recording and playback:

export class ChatComponent {
  constructor(
    private audioService: AudioService,
    private signalRService: SignalRService
  ) {
    // Handle completed recordings
    this.audioService.audioRecorded$
      .pipe(takeUntil(this.destroy$))
      .subscribe(async audioBlob => {
        try {
          // Send audio through SignalR
          await this.signalRService.sendAudioMessage(audioBlob);
        } catch (error) {
          console.error('Error sending audio:', error);
        }
      });

    // Handle audio playback state
    this.audioService.playbackState$
      .pipe(takeUntil(this.destroy$))
      .subscribe(state => {
        this.isAudioPlaying = state.isPlaying;
        this.cdr.markForCheck();
      });
  }

  // Handle message playback
  async handleAudioPlayback(message: ChatMessage): Promise<void> {
    if (this.isPlayingAudio(message)) {
      await this.audioService.stopPlayback();
    } else {
      await this.audioService.playAudio(message.audioData!);
    }
  }
}

Resource Management

The service implements comprehensive resource cleanup:

private cleanupAudioResources(): void {
  // Stop monitoring
  this.monitorState.next({
    isMonitoring: false,
    isRecording: false,
    voiceDetected: false,
    audioLevel: 0
  });

  // Clean up analyzer
  if (this.analyzer) {
    this.analyzer.disconnect();
    this.analyzer = null;
  }

  // Close audio context
  if (this.audioContext?.state !== 'closed') {
    this.audioContext.close();
  }
  this.audioContext = null;

  // Stop recording
  if (this.mediaRecorder?.state === 'recording') {
    this.mediaRecorder.stop();
  }
  this.mediaRecorder = null;

  this.audioChunks = [];
  this.checkAudioLevel = null;
}

Browser Support

Chrome/Edge (recommended)
Firefox
Safari (limited support)

Known Limitations

Audio recording requires explicit user permission
Webcam access requires HTTPS in production
Some browsers may have limited codec support

📚 Documentation

Comprehensive documentation is available in the docs/ directory:

Product Overview - Executive summary and value proposition
Architecture Guide - Technical deep-dive and system design
Demo Guide - Interactive demonstration flows
Development Guide - Complete developer setup and contribution guide

🤝 Contributing

Contributions are welcome! This project follows clean architecture principles and modern development practices. Please see the Development Guide for:

Development environment setup
Code architecture patterns
Testing strategies
Pull request process

🏢 About

ServoSkull is developed by Usual Expat Limited as a demonstration platform for advanced AI interaction capabilities. This project showcases the potential of multimodal AI agents and serves as a foundation for future robotics integration.

📄 License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.claude		.claude
.cursor/rules		.cursor/rules
.idea/.idea.ServoSkull/.idea		.idea/.idea.ServoSkull/.idea
ServoSkull.Angular		ServoSkull.Angular
ServoSkull.ApiService		ServoSkull.ApiService
ServoSkull.AppHost		ServoSkull.AppHost
ServoSkull.Core		ServoSkull.Core
ServoSkull.Infrastructure		ServoSkull.Infrastructure
ServoSkull.ServiceDefaults		ServoSkull.ServiceDefaults
ServoSkull.Tests		ServoSkull.Tests
ServoSkull.Web		ServoSkull.Web
assets		assets
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ServoSkull.sln		ServoSkull.sln
azure.yaml		azure.yaml
next-steps.md		next-steps.md

Folders and files

Latest commit

History

Repository files navigation

ServoSkull 🤖

✨ Key Features

🎤 Voice-Activated AI Interaction

👁️ Computer Vision Integration

🧠 Advanced AI Capabilities

⚡ Modern Architecture

🚀 Quick Start

Prerequisites

System Requirements

Installation & Setup

.NET Aspire Integration

App Host Structure

Development Workflow

Production Deployment

Monitoring and Debugging

System design

Frontend Services

State Management

Error Handling

Diagrams

Browser API Integration

Camera Integration

Audio Recording and Playback

Voice Activity Detection

Audio Analysis Setup

Volume Detection System

Smart Voice Detection

State Management and Monitoring

Integration with Chat

Resource Management

Browser Support

Known Limitations

📚 Documentation

🤝 Contributing

🏢 About

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages