A fully local, enterprise-grade AI coding assistant running on HP ZGX Nano hardware with NVIDIA's Nemotron 3 Nano model.
This repository contains setup scripts and demo materials for deploying a local AI coding assistant using:
| Component | Role | Key Specs |
|---|---|---|
| HP ZGX Nano | Hardware Platform | NVIDIA GB10 Grace Blackwell, ~120GB unified VRAM, CUDA 13.0 |
| Nemotron 3 Nano | LLM Backend | 30B params (3B active), 128K context, hybrid Mamba-Transformer MoE |
| vLLM (Docker) | Inference Server | High-throughput serving, OpenAI-compatible API |
| OpenCode | Coding Interface | Open-source TUI, agentic coding, tool calling |
Value Proposition: Zero cloud dependency, zero per-token costs, full data sovereignty, enterprise-grade performance.
| Metric | Value |
|---|---|
| Prompt Throughput | 1,000-3,500 tokens/s |
| Generation Throughput | 25-57 tokens/s |
| VRAM Usage | ~11-12GB (FP4 quantization) |
| Context Window | 128K tokens |
| Time to First Token | <1s |
chmod +x prereq-check-and-setup.sh
./prereq-check-and-setup.shThis script will:
- Verify NVIDIA GPU and drivers
- Install Docker if not present
- Add current user to docker group
- Install NVIDIA Container Toolkit if not present
- Install OpenCode if not present
- Create the OpenCode configuration file
chmod +x start-nemotron-server.sh
./start-nemotron-server.shWait for the message: Uvicorn running on http://0.0.0.0:8000
Note: First startup downloads the model (~6-8GB) and takes a few minutes. Subsequent startups take ~1-2 minutes.
In a new terminal:
cd ~/your-project-directory
opencodeType /models, and select NVIDIA Nemotron 3 Nano 30B.
├── prereq-check-and-setup.sh # Prerequisites and installation script
├── start-nemotron-server.sh # vLLM server startup script
├── quick-test-opencode-prompts.txt # Test prompts for verification
├── multifile-analysis-prompt.txt # Demo scenario for multi-file analysis
└── README.md
After setup, verify the system is working with these prompts in OpenCode:
# Test 1: Basic code generation
Create a Python class for managing a simple todo list with add, remove, and list methods.
# Test 2: File operations
Read the contents of README.md and summarize it.
# Test 3: Reasoning mode
Analyze this codebase and suggest three improvements for maintainability.
┌─────────────────────────────────────────────────────────────┐
│ HP ZGX Nano │
│ (NVIDIA GB10 Superchip) │
│ │
│ ┌─────────────────┐ ┌─────────────────────────────────┐│
│ │ OpenCode │───▶│ vLLM Server (Docker) ││
│ │ (TUI Client) │ │ Port 8000 ││
│ │ │◀───│ ││
│ └─────────────────┘ │ ┌───────────────────────────┐ ││
│ │ │ Nemotron 3 Nano │ ││
│ │ │ (FP4 Quantization) │ ││
│ │ └───────────────────────────┘ ││
│ └─────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
# Start the server
./start-nemotron-server.sh
# Stop the server
sudo docker stop $(sudo docker ps -q)
# View server logs
sudo docker logs -f $(sudo docker ps -q)
# Monitor GPU usage
watch -n 1 nvidia-smi
# Check model endpoint
curl http://localhost:8000/v1/models# Verify server is running
curl http://localhost:8000/v1/models
# Check config file
cat ~/.config/opencode/opencode.jsonThe model name in OpenCode config must exactly match what vLLM reports. Verify with:
curl http://localhost:8000/v1/modelsThis is normal, the first request warms up the cache. Subsequent requests will be faster (2-20 seconds with reasoning).
# Either use sudo, or ensure you're in the docker group
sudo docker ps
# If not in docker group, add yourself and re-login
sudo usermod -aG docker $USERThis project uses:
- Nemotron 3 Nano: NVIDIA Open Model License
- vLLM: Apache 2.0
- OpenCode: MIT
Curtis Burkhalter,Ph.D., Technical Product Marketing Manager — AI Developer Tools & HP ZGX Nano; curtis.burkhalter@hp.com or curtisburkhalter@gmail.com