Audio File Splitter for OpenAI Transcription

A Python script that splits large audio files into smaller chunks suitable for OpenAI's transcription API, which has a 25MB file size limit.

Features

Splits audio files based on file size limits (default 20MB)
Supports multiple input formats: MP3, WAV, FLAC, OGG, M4A, WMA, and more
Outputs to M4A format by default for optimal OpenAI compatibility and compression
Configurable output format (MP3, WAV, M4A) and quality settings
Maintains n8n workflow compatibility with structured output
Smart chunk size calculation to prevent oversized files

Prerequisites

Python 3.6+
FFmpeg installed on your system
- Ubuntu/Debian: sudo apt-get install ffmpeg
- macOS: brew install ffmpeg
- Windows: Download from ffmpeg.org

Installation

Clone this repository:
```
git clone <your-repo-url>
cd scripts
```

Create a virtual environment (recommended):

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install Python dependencies:
```
pip install -r requirements.txt
```

Usage

Basic usage:

python split_audio.py --input /path/to/audio.wma --output /path/to/output_dir

With options:

python split_audio.py --input audio.wma --output chunks/ --maxmb 20 --format m4a --quality medium --verbose

Options

--input: Path to input audio file (required)
--output: Output directory for chunks (required)
--maxmb: Maximum size in MB per chunk (default: 20, max recommended: 25)
--format: Output format - mp3, wav, or m4a (default: m4a)
--quality: Audio quality - high, medium, or low (default: medium)
--verbose: Show detailed processing information
--no-log: Disable logging to file (logs are enabled by default)

n8n Integration

The script outputs "Exporting /path/to/chunk" messages to stdout for easy parsing in n8n workflows. Use the included n8n_parser_improved.js for robust parsing:

const filePaths = $('Execute Audio Splitter Script').first().json.stdout
    .split("\n")
    .filter(line => line.includes("Exporting "))
    .map(line => line.replace("Exporting ", "").trim());

Output Quality Settings

High: 192 kbps, 44.1 kHz (best quality, larger files)
Medium: 128 kbps, 44.1 kHz (recommended - good quality/size balance)
Low: 96 kbps, 22.05 kHz (smallest files, may affect transcription quality)

Note: M4A format is ~30% more efficient than MP3, so M4A at medium quality often provides better results than MP3 at the same bitrate.

Logging

The script automatically creates detailed logs in the logs/ directory with timestamps. Each execution creates a new log file named audio_splitter_YYYYMMDD_HHMMSS.log.

Log contents include:

Script execution details and parameters
Input file analysis results
Processing progress for each chunk
Error messages and warnings
Processing summary with file sizes
Performance metrics

Managing log files:

# Keep logs from last 30 days
python cleanup_logs.py --days 30

# Keep only the 10 most recent log files
python cleanup_logs.py --count 10

To disable logging, use the --no-log flag:

python split_audio.py --input audio.wma --output chunks/ --no-log

Format Recommendations:

M4A (default): Best compression efficiency, smaller files, fully compatible with OpenAI
MP3: Universal compatibility, slightly larger files than M4A
WAV: Uncompressed, largest files but highest quality

Troubleshooting

"No audio stream found in the file": The input file may be corrupted or not contain audio
Chunks exceed 25MB: Try using --quality medium or --quality low
FFmpeg not found: Ensure FFmpeg is installed and in your system PATH
Check the logs: Look in the logs/ directory for detailed error information

Serverless Deployment (Google Cloud Run)

Deploy the audio splitter as a serverless API on Google Cloud Run for scalable, on-demand processing.

Features

RESTful API with FastAPI
Google Cloud Storage integration for file storage
Signed URLs for secure file downloads
Webhook notifications for async processing
Auto-scaling with Cloud Run

Quick Deploy

Prerequisites:

# Install Google Cloud SDK
curl https://sdk.cloud.google.com | bash

# Authenticate
gcloud auth login
gcloud config set project YOUR_PROJECT_ID

Update deployment script:

# Edit deploy.sh and set your project ID
sed -i 's/your-project-id/YOUR_PROJECT_ID/g' deploy.sh

Deploy to Cloud Run:
```
./deploy.sh
```

API Endpoints

Upload and Split Audio

curl -X POST \
  -F "file=@audio.mp3" \
  -F "max_size_mb=20" \
  -F "output_format=m4a" \
  https://your-service.run.app/split

Response:

{
  "job_id": "20250729115940_audio.mp3",
  "status": "completed",
  "total_chunks": 5,
  "chunks": [
    {
      "chunk_number": 1,
      "filename": "chunk_001.m4a",
      "size_mb": 19.8,
      "download_url": "https://storage.googleapis.com/..."
    }
  ]
}

Process File from GCS

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "gcs_path": "gs://my-bucket/audio.mp3",
    "max_size_mb": 20,
    "webhook_url": "https://myapp.com/webhook"
  }' \
  https://your-service.run.app/split-from-gcs

n8n Integration

Replace your Execute Command node with an HTTP Request node:

Method: POST
URL: https://your-service.run.app/split
Send Binary Data: Enable
Binary Property: Select your audio file
Options > Query Parameters:
- max_size_mb: 20
- output_format: m4a

Architecture Options

Basic API (audio_splitter_api.py):
- Simple file upload/download
- Temporary local storage
- Good for testing
GCS-Integrated API (audio_splitter_gcs.py):
- Direct Google Cloud Storage integration
- Signed URLs for secure access
- Production-ready with webhooks

Configuration

Environment variables for Cloud Run:

GCS_BUCKET_NAME: Storage bucket for chunks
SIGNED_URL_EXPIRY_HOURS: URL expiration time (default: 24)
PORT: Server port (default: 8080)

Performance

Memory: 2GB (configurable in cloudbuild.yaml)
CPU: 2 vCPUs
Timeout: 10 minutes
Concurrency: 10 requests per instance
Auto-scaling: 0-100 instances

Monitoring

View logs and metrics:

# View logs
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=audio-splitter" --limit 50

# View metrics in Cloud Console
gcloud run services describe audio-splitter --region=us-central1

Cost Optimization

Files are automatically deleted after 7 days
Cloud Run scales to zero when not in use
Use --max-instances to control costs

Storage Configuration & Retention

Current Setup

The service uses Google Cloud Storage bucket audio-splitter-chunks-duhworks with the following structure:

audio-splitter-chunks-duhworks/
├── chunks/                    # Audio file chunks
│   └── {job_id}/
│       ├── chunk_001.m4a
│       ├── chunk_002.m4a
│       └── ...
└── transcriptions/           # Transcription results
    └── {job_id}/
        └── full_transcript.txt

Retention Policy

Default: 7 days - All files (chunks and transcriptions) are automatically deleted after 7 days.

Changing Retention Without Downtime

Method 1: Quick Update via gsutil (No rebuild required)

# View current lifecycle policy
gsutil lifecycle get gs://audio-splitter-chunks-duhworks

# Change to 30 days retention
cat > lifecycle.json << EOF
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "Delete"},
        "condition": {"age": 30}
      }
    ]
  }
}
EOF
gsutil lifecycle set lifecycle.json gs://audio-splitter-chunks-duhworks
rm lifecycle.json

# Verify the change
gsutil lifecycle get gs://audio-splitter-chunks-duhworks

Method 2: Different Retention for Chunks vs Transcriptions

# Keep chunks for 7 days, transcriptions for 90 days
cat > lifecycle.json << EOF
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "Delete"},
        "condition": {
          "age": 7,
          "matchesPrefix": ["chunks/"]
        }
      },
      {
        "action": {"type": "Delete"},
        "condition": {
          "age": 90,
          "matchesPrefix": ["transcriptions/"]
        }
      }
    ]
  }
}
EOF
gsutil lifecycle set lifecycle.json gs://audio-splitter-chunks-duhworks
rm lifecycle.json

Method 3: Archive to Cheaper Storage

# Move to Nearline storage after 30 days, delete after 365 days
cat > lifecycle.json << EOF
{
  "lifecycle": {
    "rule": [
      {
        "action": {
          "type": "SetStorageClass",
          "storageClass": "NEARLINE"
        },
        "condition": {
          "age": 30,
          "matchesStorageClass": ["STANDARD"]
        }
      },
      {
        "action": {"type": "Delete"},
        "condition": {"age": 365}
      }
    ]
  }
}
EOF
gsutil lifecycle set lifecycle.json gs://audio-splitter-chunks-duhworks
rm lifecycle.json

Other Runtime Configuration Changes (No Rebuild)

1. Change Signed URL Expiry Time

# Update Cloud Run environment variable (default: 24 hours)
gcloud run services update audio-splitter-drive \
  --update-env-vars SIGNED_URL_EXPIRY_HOURS=48 \
  --region us-central1

2. Update Bucket Name

# Create new bucket
gsutil mb -p duhworks -c STANDARD -l us-central1 gs://new-bucket-name

# Update Cloud Run service
gcloud run services update audio-splitter-drive \
  --update-env-vars GCS_BUCKET_NAME=new-bucket-name \
  --region us-central1

3. Change Chunk Storage Prefix

gcloud run services update audio-splitter-drive \
  --update-env-vars GCS_CHUNK_PREFIX=audio-chunks/ \
  --region us-central1

Monitoring Storage Usage

# View bucket size
gsutil du -sh gs://audio-splitter-chunks-duhworks

# View detailed usage by folder
gsutil du -h gs://audio-splitter-chunks-duhworks/*

# List old files that will be deleted soon
gsutil ls -L gs://audio-splitter-chunks-duhworks/** | grep -B1 "Creation time" | grep -B1 "$(date -d '6 days ago' '+%Y-%m-%d')"

Best Practices

For Production: Use 30-90 day retention for transcriptions, 7-14 days for audio chunks
For Compliance: Implement separate buckets with different retention policies
For Cost Optimization:
- Use lifecycle rules to transition to cheaper storage classes
- Monitor bucket size regularly
- Consider shorter retention for large audio files

Storage Costs

With current 7-day retention:

Standard storage: ~$0.020/GB/month
Effective cost with 7-day retention: ~$0.0047/GB
Example: Processing 100GB/month costs ~$0.47 in storage

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.claude/agents		.claude/agents
deployment		deployment
docs		docs
legacy		legacy
scripts		scripts
src		src
tests		tests
workflows/n8n		workflows/n8n
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
api_requirements.txt		api_requirements.txt
cloudbuild.yaml		cloudbuild.yaml
requirements-dev.txt		requirements-dev.txt
requirements-production.txt		requirements-production.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Audio File Splitter for OpenAI Transcription

Features

Prerequisites

Installation

Usage

Options

n8n Integration

Output Quality Settings

Logging

Troubleshooting

Serverless Deployment (Google Cloud Run)

Features

Quick Deploy

API Endpoints

Upload and Split Audio

Process File from GCS

n8n Integration

Architecture Options

Configuration

Performance

Monitoring

Cost Optimization

Storage Configuration & Retention

Current Setup

Retention Policy

Changing Retention Without Downtime

Method 1: Quick Update via gsutil (No rebuild required)

Method 2: Different Retention for Chunks vs Transcriptions

Method 3: Archive to Cheaper Storage

Other Runtime Configuration Changes (No Rebuild)

1. Change Signed URL Expiry Time

2. Update Bucket Name

3. Change Chunk Storage Prefix

Monitoring Storage Usage

Best Practices

Storage Costs

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages