A fully autonomous voice-controlled robot powered by AI. Say a command, and the robot uses computer vision to locate and chase the target in real-time.
- 🎙️ Wake Word Detection — Say "Hey Bro, track the cat" to activate (no button press needed).
- 🛑 Global Stop Command — Say "Stop" at any time to immediately halt the robot.
- 👁️ Real-time Object Tracking — Powered by YOLOv8n running on a Mac with Apple Silicon (MPS).
- 🏎️ Proportional Pursuit Controller — The robot steers and drives simultaneously, following curved paths to chase moving targets.
- 🔊 Two-Way Voice — The robot speaks back using text-to-speech via the on-board speaker.
- 🌐 ROS 2 Backbone — All inter-device communication runs over ROS 2 via
rosbridge.
┌──────────────────── RASPBERRY PI 5 ─────────────────────┐
│ │
│ pi_camera.py → publishes /image_raw/compressed │
│ pi_audio_node.py → publishes /audio_raw │
│ → subscribes /robot_voice (speaker) │
│ arduino_bridge → subscribes /robot_commands │
│ │ │
│ Arduino Nano ─── Servo Motors (L/R wheels) │
└───────────────────────────────────────────────────────────┘
│ WiFi / ROS 2 rosbridge (port 9090)
┌──────────────────────── MAC ───────────────────────────────┐
│ │
│ robot_agent.py │
│ ├── VAD + Whisper (STT) — listens for wake word │
│ ├── Ollama / gemma3:1b — extracts target object │
│ ├── YOLOv8n + OpenCV — tracks object in frame │
│ ├── P-Controller — sends motor commands │
│ └── macOS TTS (say) — speaks responses to Pi │
└─────────────────────────────────────────────────────────────┘
Autonomous-robot/
│
├── ai_agent/
│ ├── robot_agent.py # Main AI agent (run on Mac)
│ ├── vision_test.py # Standalone vision & motor debug script
│ └── requirements.txt
│
├── raspberry_pi/
│ ├── pi_camera.py # ROS 2 camera node (GStreamer → /image_raw)
│ ├── pi_audio_node.py # ROS 2 audio node (Mic → /audio_raw, /robot_voice → Speaker)
│ ├── requirements.txt
│ └── ros2_nodes/
│ └── motor_control/ # ROS 2 Python package
│ ├── motor_control/
│ │ └── arduino_bridge.py # /robot_commands → Serial → Arduino
│ ├── package.xml
│ ├── setup.py
│ └── setup.cfg
│
├── arduino/
│ └── motor_firmware/
│ └── motor_firmware.ino # Arduino Nano servo controller firmware
│
└── assets/
└── cat_chaser_2.mp4 # Demo video: robot chasing a cat
Hardware:
- Raspberry Pi 5
- Raspberry Pi Camera Module (libcamera compatible)
- I2S Microphone Array (e.g. Google Voice HAT:
googlevoicehat-soundcardoverlay) - I2S Amplifier + Speaker (e.g. MAX98357A)
- Arduino Nano + 2× Continuous Rotation Servos
Software:
- ROS 2 Jazzy (on Pi) +
rosbridge_suite - Python 3.11+
- Ollama with
gemma3:1bmodel (on Mac)
Open arduino/motor_firmware/motor_firmware.ino in the Arduino IDE and upload it to your Nano.
Motor control protocol over Serial (115200 baud):
| Command | Meaning |
|---|---|
L<val> |
Set Left servo (0-180) |
R<val> |
Set Right servo (0-180) |
S |
Stop both motors |
# Install system dependencies
sudo apt install ros-jazzy-rosbridge-suite python3-pyaudio
# Install Python dependencies
pip install -r raspberry_pi/requirements.txt
# Add to /boot/firmware/config.txt:
# dtoverlay=googlevoicehat-soundcard
# Start rosbridge
ros2 launch rosbridge_server rosbridge_websocket_launch.xml
# Start camera node
python3 raspberry_pi/pi_camera.py
# Start audio node
python3 raspberry_pi/pi_audio_node.py
# Start motor bridge (in your ROS 2 workspace)
ros2 run motor_control arduino_bridge# Create a virtual environment
python3 -m venv venv && source venv/bin/activate
# Install dependencies
pip install -r mac_agent/requirements.txt
# Pull the LLM model
ollama pull gemma3:1b
# Set the Pi's IP address in robot_agent.py
# PI_IP = '192.168.x.x'
# Run!
python mac_agent/robot_agent.py| Say... | Effect |
|---|---|
"Hey Robo, track the cat" |
Robot starts chasing the cat |
"Hey Robo, track the person" |
Robot starts following a person |
"Stop" |
Robot immediately halts (always active) |
Trackable objects: person, cat, dog, bottle, cup, backpack, laptop, phone, ball, and more.
| Parameter | Default | Description |
|---|---|---|
PI_IP |
'192.168.x.x' |
Raspberry Pi's IP address |
OLLAMA_MODEL |
'gemma3:1b' |
Ollama model for command parsing |
ENERGY_THRESH |
0.02 |
Microphone sensitivity for VAD |
FAR_PX |
800 |
Bounding box width (px) to start moving forward |
CLOSE_PX |
1500 |
Bounding box width (px) to stop/back up |
CONFIDENCE |
0.4 |
YOLO detection confidence threshold |
The robot tracking and chasing a cat around the room: