A beautiful, native macOS application for generating AI videos with synchronized audio from text prompts using the LTX-2 model, running natively on Apple Silicon with MLX.
- Native macOS App - Built with SwiftUI for a seamless Mac experience
- Apple Silicon Native - Uses MLX framework for optimal performance on M-series chips
- Text-to-Video Generation - Transform text prompts into video clips
- Image-to-Video - Animate images into videos
- Built-in Audio Generation - Available model variants generate synchronized audio with video automatically
- Voiceover Narration - Add TTS voiceover using ElevenLabs (cloud) or MLX-Audio (local)
- Background Music - Generate instrumental music with 54 genre presets via ElevenLabs Music API
- Auto Package Installer - Missing Python packages are detected and can be installed with one click
- Generation Queue - Queue multiple generations with real-time progress tracking
- History Management - Browse, preview, and manage all your generated videos
- Presets - Save and load generation parameter presets
- Customizable Parameters - Fine-tune resolution, frames, steps, guidance scale, and more
- macOS 14.0 or later
- Apple Silicon Mac (M1, M2, M3, M4 series)
- 32GB RAM minimum (64GB+ recommended for higher resolutions)
- Python 3.10+ installed (via Homebrew, pyenv, or system)
- ~20-42GB disk space for model weights (depends on selected model)
Download the latest release from the Releases page.
- Open LTX Video Generator
- Go to Preferences (⌘,)
- Click Auto Detect to find your Python installation, or manually set the path
- Click Validate Setup - the app will check for required packages
If packages are missing, the app will show an "Install Missing Packages" button. Click it to automatically install:
mlx mlx-vlm mlx-video-with-audio transformers safetensors huggingface_hub numpy opencv-python tqdm
Or install manually:
pip install mlx mlx-vlm mlx-video-with-audio transformers safetensors huggingface_hub numpy opencv-python tqdmThe mlx-video-with-audio package is available on PyPI and provides the unified audio-video generation.
Important: On first generation, the app downloads your selected model from Hugging Face. This is a one-time download that may take 15-30 minutes depending on model size and internet connection.
The model is cached in ~/.cache/huggingface/ and will not be re-downloaded on subsequent runs.
Progress is shown in the app during download.
Available models:
- LTX-2 Unified (
notapalindrome/ltx2-mlx-av, ~42GB) - LTX-2.3 Distilled Q4 (
dgrauet/ltx-2.3-mlx-distilled-q4, ~19.4GB)
- Enter a descriptive prompt in the text field
- Adjust parameters using presets or manual controls
- Click Generate to start
- Watch progress in the Queue sidebar
- Find completed videos in your configured output directory (default: Application Support)
When enabled in Settings > Generation, Gemma rewrites your prompt before generation—expanding short descriptions into detailed, LTX-2–optimized prompts with visuals, audio, camera movement, and style. Use the Preview enhanced prompt button to see the rewritten prompt before generating.
Note: This enhancer is optional.
The core text encoder used for generation embeddings is still required even when prompt enhancement is off.
- Go to Settings > Generation
- Turn on Enable Gemma Prompt Enhancement
- First run downloads the Gemma enhancer (~7GB)
- In the prompt view, expand Prompt Enhancement (Gemma) and adjust sliders (Repetition Penalty, Top-P) if desired
- Click Preview enhanced prompt to see the enhanced version before generating
- Generate as usual—the enhanced prompt is used automatically
If enhancement fails for any reason, generation automatically falls back to your original prompt.
- Be descriptive: "A river flowing through a misty forest at dawn" works better than "river forest"
- Use camera directions: "The camera slowly pans across..."
- Specify lighting: "golden hour lighting", "dramatic shadows"
- Include motion: "waves crashing", "leaves falling"
For more detailed, copy-paste-ready prompts, see Example Prompts.
Selected models generate synchronized audio alongside video automatically. No additional configuration needed - just generate and your video will have audio.
For best speech/lip-sync alignment, use 24 FPS.
You can still layer additional voiceover or background music on top of the built-in audio if desired.
Add text-to-speech voiceover to your videos:
- Expand Voiceover / Narration in the generation view
- Choose your source: MLX-Audio (local, free) or ElevenLabs (cloud, requires API key)
- Select a voice from the dropdown
- Enter your narration text
- Audio generates with your video or can be added later from History
Add AI-generated instrumental music (requires ElevenLabs API key):
- Expand Background Music in the generation view
- Toggle Generate background music
- Choose from 54 genre presets:
- Electronic: EDM, House, Techno, Ambient, Synthwave, etc.
- Hip-Hop/R&B: Trap, Lo-Fi, Boom Bap, Soul, etc.
- Rock: Classic, Alternative, Indie, Metal, etc.
- Pop: Modern, Indie, Dance, Acoustic
- Jazz/Blues: Smooth Jazz, Bebop, Lounge, Blues
- Classical/Cinematic: Orchestral, Piano, Epic, Tense, Uplifting
- World: Latin, Reggae, Afrobeat, Middle Eastern, Asian
- Country/Folk: Modern, Classic, Acoustic, Indie
- Functional: Corporate, Motivational, Relaxing, Suspense, Action, Romantic, etc.
Music automatically matches your video length and is mixed at background volume (30%) or ducked further (20%) when combined with voiceover.
Right-click any video thumbnail in Video Archive and select Add Audio to add voiceover, music, or both to previously generated videos.
Here's an example video generated with LTX Video Generator:
E0B09876-A6BE-4A70-A1F3-09D7152F5003.mp4
Prompt used:
Create a 15-second cinematic product commercial for a sleek, premium TIME MACHINE device called "ChronoShift One."
Overall style: glossy tech product ad, filmed in 4K, smooth dolly and slider shots, soft studio lighting, subtle retro‑futuristic aesthetic (think brushed aluminum, glowing rings, clean UI). The time machine looks like a compact desktop appliance about the size of a toaster: brushed metal body, circular time dial with glowing blue light, small display, and a single illuminated control knob.
And a second run produced this one:
1BB3F818-F52F-4E5F-B52F-2DB9C18717FE.mp4
Prompt used:
Scene tone: quiet, reflective, fragmented memory. Cinematic realism, muted natural colors. Overcast but DRY weather. No rain, no raindrops, no wet falling precipitation.
START FRAME (0-2.5s)
Extreme close-up (85mm) of the elderly man's face. He breathes slowly. A tiny tremor in the lower eyelid. Strands of white hair drift gently in a light breeze.
Dialogue (man, barely above a whisper):
"I remember."Motion: micro push-in only.
JUMP CUT 1 (2.5-5s)
Hard cut to an extreme close-up of his hands: weathered fingers rubbing a small object (a coin / pebble / ring) in his palm.
Dialogue (man):
"Not the day..."Motion: hands move slowly, deliberately.
JUMP CUT 2 (5-7.5s)
Hard cut to close-up (50-85mm) of his boots stepping into soft mud at the lake edge. The movement is careful, almost hesitant. No splashing, just a quiet press into wet ground.
Dialogue (man):
"The feeling."Motion: one slow step, then stillness.
JUMP CUT 3 (7.5-10s)
Hard cut to close-up of the lake surface: perfectly still water with faint ripples spreading outward (from a dropped pebble or a gentle touch).
Dialogue (man):
"It stayed."
# Clone the repository
git clone https://github.com/james-see/ltx-video-mac.git
cd ltx-video-mac
# Open in Xcode
open LTXVideoGenerator/LTXVideoGenerator.xcodeproj
# Or build from command line
./scripts/build-local.sh- Frontend: SwiftUI
- Python Bridge: Subprocess execution with progress streaming
- ML Framework: MLX (Apple's machine learning framework)
- Models:
- LTX-2 Unified (~42GB, synchronized audio+video)
- LTX-2.3 Distilled Q4 (~19.4GB, synchronized audio+video)
- Precision: bfloat16
Generation uses a 2-stage pipeline:
- Stage 1: Generate at half resolution
- Stage 2: Upsample and refine to full resolution
The download progress updates every 1%. Download time depends on selected model size (~19.4GB or ~42GB). Be patient.
- Reduce resolution (512x320 is fastest)
- Reduce frame count (25/33/49 recommended)
- Use 24 FPS
- Set VAE tiling to aggressive
- Close other applications
- 32GB RAM minimum, 64GB recommended
- Install Python via Homebrew:
brew install [email protected] - Or use pyenv:
pyenv install 3.12 - Then click "Auto Detect" in Preferences
- This app supports multiple AV model repos, including
notapalindrome/ltx2-mlx-avanddgrauet/ltx-2.3-mlx-distilled-q4. - Converting additional upstream checkpoints can require package-level updates in
mlx-video-with-audiobefore they run reliably here. - Standard LTX LoRA workflows are not guaranteed to transfer directly to the MLX-converted AV path without conversion tooling support.
MIT License - see LICENSE for details.
- Lightricks for the LTX-2 model
- mlx-video-with-audio for unified audio-video generation
- MLX Community for the MLX-converted weights
- Blaizzy/mlx-video for the original MLX video generation code
- Hugging Face for model hosting
