This folder contains the deployment files for running Sesame CSM (Conversational Speech Model) on Cerebrium's serverless GPU platform.
This project is licensed under the MIT License - see the LICENSE file for details.
You are free to:
- Use this software for any purpose
- Copy, modify, and distribute it
- Use it commercially and privately
The only requirement is to include a copy of the license and copyright notice.
-
Cerebrium Account: Sign up at dashboard.cerebrium.ai
-
HuggingFace Access: You need access to these gated models:
- sesame/csm-1b - Request access
- meta-llama/Llama-3.2-1B - Request access
-
HuggingFace Token: Generate at huggingface.co/settings/tokens
pip install cerebrium --upgradecerebrium loginIn your Cerebrium dashboard, go to Secrets and add:
HF_TOKEN: Your HuggingFace tokenHF_HUB_ENABLE_HF_TRANSFER:1HF_HOME:/persistent-storage/.cache/huggingface/hub
From this directory:
cerebrium deployThe first deployment takes 5-10 minutes to download models and build the container.
After deployment, you'll receive an endpoint URL like:
https://api.cortex.cerebrium.ai/v4/YOUR_PROJECT_ID/sesame-csm-streaming
Use this URL in the main application's setup page.
import requests
import base64
response = requests.post(
"https://api.cortex.cerebrium.ai/v4/YOUR_PROJECT/sesame-csm-streaming/predict",
json={
"text": "Hello, this is a test.",
"speaker": 0,
"stream": False
},
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
result = response.json()
audio_bytes = base64.b64decode(result["result"]["audio_data"])import requests
import json
import base64
response = requests.post(
"https://api.cortex.cerebrium.ai/v4/YOUR_PROJECT/sesame-csm-streaming/predict",
json={
"text": "Hello, this is a streaming test.",
"speaker": 0,
"stream": True
},
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Accept": "text/event-stream"
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b"data: "):
data = json.loads(line[6:])
if data.get("done"):
break
audio_chunk = base64.b64decode(data["audio"])
# Process audio chunk...Edit cerebrium.toml to adjust:
- Hardware: Change
computefor different GPUs (A10, A100, etc.) - Scaling: Adjust
min_replicas,max_replicasfor autoscaling - Memory: Increase
memoryif needed
cerebrium.toml- Deployment configurationmain.py- API endpoint handlersgenerator.py- CSM model wrapper with streamingmodels.py- Model architecture definitionsrequirements.txt- Python dependencies
Approximate costs on Cerebrium:
- A10 GPU: ~$0.0001/second
- Cold start: ~30-60 seconds (first request after idle)
- Warm inference: ~1-2 seconds for first audio chunk
Set min_replicas = 1 to avoid cold starts (but incurs idle costs).
"Model not found" error: Ensure your HF_TOKEN has access to the gated models.
Out of memory: Increase memory in cerebrium.toml or use a larger GPU.
Slow first response: This is the cold start. Set min_replicas = 1 for faster responses.