Semantic audio embeddings + PCA/UMAP visualization using MERT
This project builds a searchable, analyzable vector-space representation of popular music. Given a list of song titles, the system:
- Resolves each song into canonical metadata (artist, album, year, etc.)
- Obtains audio from a legal source
- Extracts semantic music embeddings using MERT-v1-95M
- Stores metadata and embeddings in a database (Postgres + pgvector recommended)
- Deletes audio files after embedding
- Provides tools for PCA/UMAP visualization and similarity search
The goal is to produce a "map of music" where similar songs cluster naturally based on genre, mood, timbre, instrumentation, and production style.
- Music-specific semantic embeddings using MERT (Music Encoder Representations from Transformers)
- Robust audio ingestion → chunking → processing workflow
- Metadata resolution via external music APIs (e.g., Spotify / MusicBrainz)
- Clean, modular ETL design (ingest → process → embed → store → clean)
- Vector database support using pgvector
- Tools for PCA, UMAP, and similarity search
- Track-level robustness: retries, status flags, resumable processing
- Scales to thousands of songs with small GPU or CPU-only (slower)
src/
ingest/
song_list_ingest.py # Import initial list of songs into DB
metadata_resolver.py # Match titles to canonical track metadata
audio/
fetcher.py # Fetch audio from a legal source
preprocess.py # Load, resample, and chunk audio
models/
mert_embedder.py # MERT embedding pipeline
db/
schema.sql # Track + embedding tables (Postgres/pgvector)
repository.py # Database interaction layer
workers/
process_track_worker.py # Main track-processing workflow
analysis/
pca_umap.py # Dimensionality reduction tools
visualize.ipynb # Notebook for plots & clustering
README.md
Input: list of song titles
Output: (track_metadata, embedding_vector) saved in DB
Pipeline:
- Ingest: read titles → resolve canonical metadata → create DB rows
- Audio Fetch: retrieve audio from a legal source
- Chunk: split track into fixed-length segments (e.g., 10–15 sec)
- Embed: compute MERT embeddings per chunk → pool into one track vector
- Store: save vector + metadata → mark status = "done"
- Clean: delete audio file and temporary data
MERT is a transformer trained on large music datasets. Its embeddings capture high-level musical properties such as:
- Genre
- Mood / emotional tone
- Timbre & instrumentation
- Production style
- Acoustic vs electronic characteristics
MERT-v1-95M is chosen for:
- Good semantic quality
- Fast embedding
- Low hardware requirements
- Easy deployment (HuggingFace Transformers)
Recommended backend: Postgres + pgvector.
CREATE TABLE tracks (
id SERIAL PRIMARY KEY,
title TEXT,
artist TEXT,
album TEXT,
release_year INT,
genres TEXT[],
isrc TEXT,
source_id TEXT,
duration_ms INT,
embedding VECTOR(768),
status TEXT DEFAULT 'pending',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);git clone <your_repo_url>
cd music-embedding
pip install -r requirements.txtRequirements include:
- torch
- torchaudio
- transformers
- librosa
- psycopg2 or asyncpg
- pgvector (server extension)
- numpy, pandas
- umap-learn (optional)
python src/ingest/song_list_ingest.py data/songs.csvCSV format:
title,artist
Blinding Lights,The Weeknd
Shape of You,Ed Sheeran
...python src/ingest/metadata_resolver.pypython src/workers/process_track_worker.pyThis:
- Fetches audio
- Chunks it
- Runs MERT
- Stores embeddings + metadata
- Deletes audio
python src/analysis/pca_umap.pyOr use the notebook:
jupyter notebook src/analysis/visualize.ipynbThis project includes tools to produce:
- 2D PCA plots
- UMAP embedding maps
- Clustering diagrams
- Genre or artist-overlaid scatterplots
These help validate that the embedding space captures meaningful musical structure.
You must ensure you have the legal right to download, store, or process audio. This project does not include code to download copyrighted audio from YouTube or other services in violation of their Terms of Service.
Safe options include:
- Legally purchased files
- Public-domain audio
- Licensed datasets
- Short previews (if allowed under applicable API terms)
This architecture supports incremental scaling through:
- Job queues (Redis, RQ, Celery, etc.)
- Parallel workers
- Docker-based batch jobs
- Cloud GPU instances (optional; not required for a 3k–5k track catalog)
The system is designed to be modular so you can swap:
- Audio sources
- Embedding models
- Metadata providers
- Database backends
Without rewriting core logic.
- Add MuLan or larger MERT models for richer embeddings
- Integrate pgvector similarity search (kNN neighbors)
- Build an interactive music map frontend (Plotly, Streamlit, or web)
- Analyze playlists, charts, decades, genres
- Build a recommendation engine using nearest neighbors
One natural extension of this project is to go beyond track-level embeddings and instead analyze music at the level of song sections. Because the pipeline already processes each track in 10-second chunks, we can attach additional structure and semantics to these chunks, enabling several powerful analytical and visualization capabilities.
Each chunk can be labeled as:
- Intro
- Verse
- Chorus
- Bridge
- Outro
- (Optional: "Pre-chorus", "Drop", etc.)
Section identification could be implemented using:
- simple heuristics (e.g., first 20–30 seconds = intro)
- audio pattern analysis
- supervised models trained for structure segmentation
- alignment to externally provided timestamps (e.g., if metadata exists)
These labels allow the database to store both entire-track embeddings and section-specific embeddings.
With section labels, we can generate specialized semantic maps:
- Chorus-only PCA
- Verse-only PCA
- Intro-only embeddings
- Bridge clustering maps
Since choruses tend to encode a song's "core identity," analyzing only these chunks might produce even cleaner genre and mood clusters than full-track embeddings.
This unlocks experiments like:
- Comparing verse vs. chorus semantic spaces
- Seeing whether choruses cluster more tightly across genres
- Studying how intros set up (or mislead) listener expectations
For any song, we can visualize where each of its chunks sits in embedding space.
Example:
- Chorus chunks cluster tightly together
- Verse chunks form their own local neighborhood
- Bridges may sit farther away if the musical texture changes
- Outros may drift outwards due to reduced instrumentation or fading energy
This produces a micro-map for each song, showing how different sections relate to the overall music space and to each other.
Potential interfaces include:
- scatterplots of a song's chunks overlaid on the global PCA
- animated trajectories through UMAP space (a "semantic timeline")
- radial maps of section distances
- comparison across songs or artists
Section-level embedding analysis enables several new angles:
- Song structure visualization
- Comparing how different artists structure choruses
- Detecting "genre hybrids" where sections differ drastically
- Studying production techniques (e.g., bright chorus vs dark verse)
- Creating playlists using only specific song sections
- Analyzing emotional arcs or dynamic shifts inside a track
This adds a rich new dimension to the project: Not just what songs sound like, but how songs evolve over time.
Section-level embeddings extend the project from static, track-level representations to a much more dynamic and musicologically interesting space. By labeling and embedding each section independently, the pipeline can reveal how songs are structured, why they feel the way they do, and how different parts contribute to a song's identity.
While MERT provides a powerful representation of how a song sounds, it does not capture what a song is about—the meaning and sentiment carried by its lyrics. Adding a parallel pipeline for lyric embeddings enables an entirely new dimension of analysis and produces a richer, more complete semantic map of music.
The idea is to process each song's lyrics using a state-of-the-art text embedding model, such as:
- OpenAI text embedding models
- Sentence-BERT / MPNet
- Cohere embeddings
- Instructor embeddings
- Other transformer-based text encoders
These models produce dense vectors that encode:
- themes (love, heartbreak, rebellion, etc.)
- narrative style
- sentiment (positive/negative/bittersweet)
- imagery and metaphors
- language properties (abstract vs concrete, simple vs poetic)
Each song would therefore have two parallel embeddings:
- Audio embedding (MERT)
- Lyric embedding (text model)
Just like audio embeddings, lyric vectors can be visualized in 2D or 3D:
- Clusters of songs with similar themes
- "Sad" vs "happy" axes
- Pop vs hip-hop vs folk storytelling styles
- Genre-independent semantic groupings
This often produces a radically different map from the audio PCA, revealing how songs relate conceptually even when they sound nothing alike.
Once both embeddings exist, they can be combined in multiple ways:
- Concatenation: [audio || lyrics]
- Weighted sum: α * audio + β * lyrics
- Learned projection: small neural network mapping the two spaces into a shared space
This produces a multi-modal music embedding that captures both:
- What a song sounds like
- What a song means
Such hybrid embeddings can surface relationships invisible to single-modality models.
With dual embeddings, we can study fascinating patterns:
- Songs with happy sound + sad lyrics
- Genre clusters vs. thematic clusters
- Artists whose sound rarely matches their lyrical themes
- Decade-level trends in sound vs meaning
- Emotion contrast inside albums or genres
- Cross-cultural lyrical similarities for musically different songs
This opens up musicological research directions and uniquely rich visualizations.
The system could support:
- thematic playlist generation (e.g., "heartbreak songs", "uplifting songs")
- similarity search based on lyrics only
- identifying lyrical twins of sonically different tracks
- building maps colored by lyrical sentiment or topic distributions
These features complement audio-based maps and recommendations.
Adding lyric embeddings transforms the project into a multi-modal semantic analysis tool. With both sound-based and meaning-based vectors, you gain the ability to explore music from two orthogonal perspectives and combine them into a unified representation. This extension dramatically increases the interpretive power and creative applications of the system.
(Insert your chosen license here.)