π― Elevator Pitch
EchoSight is an iOS AR navigation assistant designed for visually impaired users, combining cutting-edge technologies to provide real-time obstacle awareness and intelligent voice interaction. The app leverages iPhone's LiDAR sensor for precise distance measurement, on-device object recognition (FastViT Core ML model), and cloud-based visual language model (Google's Gemini 2.5 Flash Lite) to deliver comprehensive environmental understanding.
With two specialized modesβObstacle Avoidance for safety-focused navigation and Voice Interaction for natural scene explorationβEchoSight empowers blind and low-vision users to navigate their surroundings with greater confidence and independence.
Developed by: Bobby Tang & Yuan Gao
- Real-time LiDAR Distance Measurement: Continuously scans forward obstacles, auto-triggers danger warning when < 0.5m
- Object Recognition: On-device object classification using FastViT Core ML model
- Smart Voice Announcements: Automatically speaks detected object name and distance (e.g., "Chair, 0.8 meters")
- Strong Haptic Feedback: Provides 6 intense vibration pulses + continuous rumble when approaching obstacles
- AI Avoidance Suggestions: Automatically invokes VLM to provide brief avoidance guidance (β€10 words) when warning triggers
- Continuous Speech Recognition: Real-time voice recognition based on Apple Speech Framework (supports on-device processing)
- Natural Language Conversation: Users can freely ask questions (e.g., "What's ahead?" "Is it safe to walk?")
- Environment Description: Automatically describes the current environment when switching to this mode
- Scene-Aware Q&A: Answers user questions by combining camera feed and VLM (Gemini 2.5 Flash Lite)
- Echo Cancellation: Uses voiceChat audio mode to prevent TTS output from interfering with ASR
- Mode Switching: One-tap toggle between obstacle avoidance and voice interaction modes
- Pause/Resume Mechanism: Automatically pauses speech recognition during TTS playback to prevent echo interference
- Cooldown Control: Prevents overly frequent warnings and voice announcements
- Visual + Audio Dual Output: UI display + TTS speech for multi-sensory feedback
EchoSight/
βββ EchoSightApp.swift # App entry point
βββ ContentView.swift # Main UI (SwiftUI)
β # - Mode switching logic
β # - Speech recognition integration
β # - VLM interaction handling
β # - Haptic feedback
βββ ARManager.swift # AR session management
β # - LiDAR scene reconstruction
β # - Multi-point raycast distance measurement
β # - FastViT object recognition
β # - Camera frame image extraction
βββ VLMService.swift # VLM service client
β # - Obstacle avoidance advice
β # - Natural language scene Q&A
β # - Environment description
β # - OpenRouter API integration
βββ SpeechRecognitionService.swift # Speech recognition service
β # - Continuous voice listening
β # - Silence detection (3s timeout)
β # - Pause/resume mechanism
β # - Echo cancellation configuration
βββ Config.swift # Configuration management
β # - API key loading
β # - Read from Secrets.plist
βββ FastViTMA36F16.mlpackage/ # FastViT Core ML model
# - On-device object recognition
# - 1000+ category support
- Device: iPhone 12 Pro or newer (LiDAR required)
- OS: iOS 15.0+
- Development: Xcode 14.0+
- Language: Swift 5.9+
- Permissions:
- Camera access
- Microphone access
- Speech recognition
git clone https://github.com/sakura657/EchoSight.git
cd EchoSight
open EchoSight.xcodeprojThe project has pre-configured permission descriptions in Info.plist:
NSCameraUsageDescription- For AR scene reconstruction and object recognitionNSMicrophoneUsageDescription- For speech recognitionNSSpeechRecognitionUsageDescription- For speech recognition service
Create EchoSight/Secrets.plist file (NOT committed to Git):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>OPENROUTER_API_KEY</key>
<string>your_openrouter_api_key_here</string>
</dict>
</plist>π‘ Tip: The app searches for API keys in this priority order:
Secrets.plistβInfo.plistβ Environment variables
SeeConfig.swiftfor implementation details.
- Visit OpenRouter
- Sign up and obtain an API key
- Add the key to the
Secrets.plistfile above
- Select a LiDAR-capable device (iPhone 12 Pro or newer)
- Press
Cmd+Rto build and run - Grant all permissions on first launch
- Launch App: The app automatically starts AR + LiDAR scanning and voice listening
- Mode Switching: Tap the top toggle button to switch between two modes
- π Orange Icon: Obstacle Avoidance Mode
- π΅ Blue Icon: Voice Interaction Mode
- π Real-time Distance: Bottom banner shows current distance (in meters)
β οΈ Danger Warning: When distance < 0.5m, displays red warning + strong vibration- π·οΈ Object Recognition: Top area shows recognized object name
- π Voice Announcements: Automatically speaks object and distance (5s cooldown)
- π€ AI Suggestions: AI automatically provides brief avoidance guidance when warning triggers
- π€ Free Questioning: Simply speak your question, no button needed (auto-sends after 3s silence)
- π Environment Description: Automatically describes environment when switching to this mode
- π¬ Conversation Examples:
- "What's ahead?"
- "Is it safe to walk?"
- "Describe my surroundings"
- "How far is the obstacle?"
- π± Real-time Feedback:
- Green box shows speech being recognized
- Blue box shows AI response
- AI response is automatically read aloud
- β‘ Battery Consumption: LiDAR and continuous speech recognition consume significant power
- π Network Required: Voice interaction mode requires internet connection (for VLM API)
- π― Best Usage: Works best in well-lit environments with rich visual features
- π Echo Handling: Speech recognition automatically pauses during TTS playback to avoid interference
β οΈ Do NOT commit secrets:.gitignoreis configured to exclude:EchoSight/Secrets.plist.env*- Large model weight files
- Core ML compiled products
- Store all API keys in
Secrets.plist - Never hardcode sensitive information in code
- Rotate API keys regularly
- Avoid committing large binary files (model weights should be downloaded externally or use Git LFS)
- FastViT Throttling: Runs every 1 second or 60 frames to avoid impacting speech recognition performance
- Image Update Limiting: VLM snapshots update every 0.5 seconds to reduce CPU load
- CIContext Reuse: Uses hardware-accelerated CIContext uniformly for image processing
- Background Thread Processing: Vision requests and image conversions execute on background threads
- Autorelease Pools: Uses
autoreleasepoolto prevent memory spikes from ARFrame retention
- On-Device First: iOS 13+ prioritizes on-device speech recognition to reduce latency
- Echo Cancellation: Uses
voiceChataudio mode for system-level echo handling - Silence Detection: Auto-sends query after 3 seconds of silence
- Minimum Word Filter: Requires at least 2 words to process, reducing false recognitions
- Contextual Hints: Provides common phrase list to improve recognition accuracy
- Model: Defaults to
google/gemini-2.5-flash-lite-preview-09-2025 - Image Compression: Auto-compresses images to under 15MB, balancing quality and speed
- Prompt Engineering: Optimized prompts for visually impaired users, requesting concise replies (10-20 words)
- Custom Model: Can modify the
modelconstant inVLMService.swift
- Multi-Point Raycasting: Uses 3 screen points for raycasting to improve detection reliability
- Ground Detection: Automatically identifies downward-pointing to ground scenarios to avoid false alarms
- Mesh Reconstruction: Uses LiDAR scene mesh reconstruction for improved accuracy
- Plane Detection: Enables both horizontal and vertical plane detection
- Depth Semantics: Enables sceneDepth on supported devices for enhanced scene understanding
| Component | Technology | Purpose |
|---|---|---|
| UI Framework | SwiftUI | Declarative interface building |
| AR Engine | ARKit + RealityKit | LiDAR scanning and scene reconstruction |
| Object Recognition | Core ML (FastViT) | On-device real-time object classification |
| Speech Recognition | Speech Framework | Continuous speech-to-text |
| Speech Synthesis | AVSpeechSynthesizer | Text-to-speech announcements |
| Haptic Feedback | Core Haptics | Strong vibration warnings |
| VLM API | OpenRouter (Gemini 2.5 Flash Lite) | Visual Q&A and environment understanding |
| Configuration | PropertyList | Secure API key storage |
Q: Speech recognition not working?
- Check if microphone permission is granted
- Ensure system speech recognition permission is enabled
- Try restarting the app
- Check if TTS is currently playing (ASR pauses automatically)
Q: VLM response errors?
- Verify API key in
Secrets.plistis correct - Ensure network connection is stable
- Check OpenRouter account balance
- Review Xcode console logs for detailed errors
Q: LiDAR distance measurements inaccurate?
- Confirm device supports LiDAR (iPhone 12 Pro+)
- Avoid use in extremely bright or completely dark environments
- Ensure camera lens is clean
- Wait for AR tracking state to reach "Tracking OK"
Q: Object recognition inaccurate?
- FastViT model is trained on 1000+ categories; some objects may be recognized as similar categories
- Ensure objects are in appropriate lighting and distance range
- Avoid rapid device movement (triggers excessiveMotion limitation)
For questions or suggestions, please open an issue.
Committed to making technology more inclusive and empowering visually impaired users to explore the world with greater confidence. πβ¨

