A Python script that splits large audio files into smaller chunks suitable for OpenAI's transcription API, which has a 25MB file size limit.
- Splits audio files based on file size limits (default 20MB)
- Supports multiple input formats: MP3, WAV, FLAC, OGG, M4A, WMA, and more
- Outputs to M4A format by default for optimal OpenAI compatibility and compression
- Configurable output format (MP3, WAV, M4A) and quality settings
- Maintains n8n workflow compatibility with structured output
- Smart chunk size calculation to prevent oversized files
- Python 3.6+
- FFmpeg installed on your system
- Ubuntu/Debian:
sudo apt-get install ffmpeg - macOS:
brew install ffmpeg - Windows: Download from ffmpeg.org
- Ubuntu/Debian:
-
Clone this repository:
git clone <your-repo-url> cd scripts
-
Create a virtual environment (recommended):
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install Python dependencies:
pip install -r requirements.txt
Basic usage:
python split_audio.py --input /path/to/audio.wma --output /path/to/output_dirWith options:
python split_audio.py --input audio.wma --output chunks/ --maxmb 20 --format m4a --quality medium --verbose--input: Path to input audio file (required)--output: Output directory for chunks (required)--maxmb: Maximum size in MB per chunk (default: 20, max recommended: 25)--format: Output format - mp3, wav, or m4a (default: m4a)--quality: Audio quality - high, medium, or low (default: medium)--verbose: Show detailed processing information--no-log: Disable logging to file (logs are enabled by default)
The script outputs "Exporting /path/to/chunk" messages to stdout for easy parsing in n8n workflows. Use the included n8n_parser_improved.js for robust parsing:
const filePaths = $('Execute Audio Splitter Script').first().json.stdout
.split("\n")
.filter(line => line.includes("Exporting "))
.map(line => line.replace("Exporting ", "").trim());- High: 192 kbps, 44.1 kHz (best quality, larger files)
- Medium: 128 kbps, 44.1 kHz (recommended - good quality/size balance)
- Low: 96 kbps, 22.05 kHz (smallest files, may affect transcription quality)
Note: M4A format is ~30% more efficient than MP3, so M4A at medium quality often provides better results than MP3 at the same bitrate.
The script automatically creates detailed logs in the logs/ directory with timestamps. Each execution creates a new log file named audio_splitter_YYYYMMDD_HHMMSS.log.
Log contents include:
- Script execution details and parameters
- Input file analysis results
- Processing progress for each chunk
- Error messages and warnings
- Processing summary with file sizes
- Performance metrics
Managing log files:
# Keep logs from last 30 days
python cleanup_logs.py --days 30
# Keep only the 10 most recent log files
python cleanup_logs.py --count 10To disable logging, use the --no-log flag:
python split_audio.py --input audio.wma --output chunks/ --no-logFormat Recommendations:
- M4A (default): Best compression efficiency, smaller files, fully compatible with OpenAI
- MP3: Universal compatibility, slightly larger files than M4A
- WAV: Uncompressed, largest files but highest quality
- "No audio stream found in the file": The input file may be corrupted or not contain audio
- Chunks exceed 25MB: Try using
--quality mediumor--quality low - FFmpeg not found: Ensure FFmpeg is installed and in your system PATH
- Check the logs: Look in the
logs/directory for detailed error information
Deploy the audio splitter as a serverless API on Google Cloud Run for scalable, on-demand processing.
- RESTful API with FastAPI
- Google Cloud Storage integration for file storage
- Signed URLs for secure file downloads
- Webhook notifications for async processing
- Auto-scaling with Cloud Run
-
Prerequisites:
# Install Google Cloud SDK curl https://sdk.cloud.google.com | bash # Authenticate gcloud auth login gcloud config set project YOUR_PROJECT_ID
-
Update deployment script:
# Edit deploy.sh and set your project ID sed -i 's/your-project-id/YOUR_PROJECT_ID/g' deploy.sh
-
Deploy to Cloud Run:
./deploy.sh
curl -X POST \
-F "file=@audio.mp3" \
-F "max_size_mb=20" \
-F "output_format=m4a" \
https://your-service.run.app/splitResponse:
{
"job_id": "20250729115940_audio.mp3",
"status": "completed",
"total_chunks": 5,
"chunks": [
{
"chunk_number": 1,
"filename": "chunk_001.m4a",
"size_mb": 19.8,
"download_url": "https://storage.googleapis.com/..."
}
]
}curl -X POST \
-H "Content-Type: application/json" \
-d '{
"gcs_path": "gs://my-bucket/audio.mp3",
"max_size_mb": 20,
"webhook_url": "https://myapp.com/webhook"
}' \
https://your-service.run.app/split-from-gcsReplace your Execute Command node with an HTTP Request node:
- Method: POST
- URL:
https://your-service.run.app/split - Send Binary Data: Enable
- Binary Property: Select your audio file
- Options > Query Parameters:
max_size_mb: 20output_format: m4a
-
Basic API (
audio_splitter_api.py):- Simple file upload/download
- Temporary local storage
- Good for testing
-
GCS-Integrated API (
audio_splitter_gcs.py):- Direct Google Cloud Storage integration
- Signed URLs for secure access
- Production-ready with webhooks
Environment variables for Cloud Run:
GCS_BUCKET_NAME: Storage bucket for chunksSIGNED_URL_EXPIRY_HOURS: URL expiration time (default: 24)PORT: Server port (default: 8080)
- Memory: 2GB (configurable in cloudbuild.yaml)
- CPU: 2 vCPUs
- Timeout: 10 minutes
- Concurrency: 10 requests per instance
- Auto-scaling: 0-100 instances
View logs and metrics:
# View logs
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=audio-splitter" --limit 50
# View metrics in Cloud Console
gcloud run services describe audio-splitter --region=us-central1- Files are automatically deleted after 7 days
- Cloud Run scales to zero when not in use
- Use
--max-instancesto control costs
The service uses Google Cloud Storage bucket audio-splitter-chunks-duhworks with the following structure:
audio-splitter-chunks-duhworks/
├── chunks/ # Audio file chunks
│ └── {job_id}/
│ ├── chunk_001.m4a
│ ├── chunk_002.m4a
│ └── ...
└── transcriptions/ # Transcription results
└── {job_id}/
└── full_transcript.txt
Default: 7 days - All files (chunks and transcriptions) are automatically deleted after 7 days.
# View current lifecycle policy
gsutil lifecycle get gs://audio-splitter-chunks-duhworks
# Change to 30 days retention
cat > lifecycle.json << EOF
{
"lifecycle": {
"rule": [
{
"action": {"type": "Delete"},
"condition": {"age": 30}
}
]
}
}
EOF
gsutil lifecycle set lifecycle.json gs://audio-splitter-chunks-duhworks
rm lifecycle.json
# Verify the change
gsutil lifecycle get gs://audio-splitter-chunks-duhworks# Keep chunks for 7 days, transcriptions for 90 days
cat > lifecycle.json << EOF
{
"lifecycle": {
"rule": [
{
"action": {"type": "Delete"},
"condition": {
"age": 7,
"matchesPrefix": ["chunks/"]
}
},
{
"action": {"type": "Delete"},
"condition": {
"age": 90,
"matchesPrefix": ["transcriptions/"]
}
}
]
}
}
EOF
gsutil lifecycle set lifecycle.json gs://audio-splitter-chunks-duhworks
rm lifecycle.json# Move to Nearline storage after 30 days, delete after 365 days
cat > lifecycle.json << EOF
{
"lifecycle": {
"rule": [
{
"action": {
"type": "SetStorageClass",
"storageClass": "NEARLINE"
},
"condition": {
"age": 30,
"matchesStorageClass": ["STANDARD"]
}
},
{
"action": {"type": "Delete"},
"condition": {"age": 365}
}
]
}
}
EOF
gsutil lifecycle set lifecycle.json gs://audio-splitter-chunks-duhworks
rm lifecycle.json# Update Cloud Run environment variable (default: 24 hours)
gcloud run services update audio-splitter-drive \
--update-env-vars SIGNED_URL_EXPIRY_HOURS=48 \
--region us-central1# Create new bucket
gsutil mb -p duhworks -c STANDARD -l us-central1 gs://new-bucket-name
# Update Cloud Run service
gcloud run services update audio-splitter-drive \
--update-env-vars GCS_BUCKET_NAME=new-bucket-name \
--region us-central1gcloud run services update audio-splitter-drive \
--update-env-vars GCS_CHUNK_PREFIX=audio-chunks/ \
--region us-central1# View bucket size
gsutil du -sh gs://audio-splitter-chunks-duhworks
# View detailed usage by folder
gsutil du -h gs://audio-splitter-chunks-duhworks/*
# List old files that will be deleted soon
gsutil ls -L gs://audio-splitter-chunks-duhworks/** | grep -B1 "Creation time" | grep -B1 "$(date -d '6 days ago' '+%Y-%m-%d')"- For Production: Use 30-90 day retention for transcriptions, 7-14 days for audio chunks
- For Compliance: Implement separate buckets with different retention policies
- For Cost Optimization:
- Use lifecycle rules to transition to cheaper storage classes
- Monitor bucket size regularly
- Consider shorter retention for large audio files
With current 7-day retention:
- Standard storage: ~$0.020/GB/month
- Effective cost with 7-day retention: ~$0.0047/GB
- Example: Processing 100GB/month costs ~$0.47 in storage
This project is licensed under the MIT License.