Skip to content

nprime06/MERT-embedding

Repository files navigation

🎵 Music Embedding Pipeline

Semantic audio embeddings + PCA/UMAP visualization using MERT

This project builds a searchable, analyzable vector-space representation of popular music. Given a list of song titles, the system:

  1. Resolves each song into canonical metadata (artist, album, year, etc.)
  2. Obtains audio from a legal source
  3. Extracts semantic music embeddings using MERT-v1-95M
  4. Stores metadata and embeddings in a database (Postgres + pgvector recommended)
  5. Deletes audio files after embedding
  6. Provides tools for PCA/UMAP visualization and similarity search

The goal is to produce a "map of music" where similar songs cluster naturally based on genre, mood, timbre, instrumentation, and production style.


✨ Features

  • Music-specific semantic embeddings using MERT (Music Encoder Representations from Transformers)
  • Robust audio ingestion → chunking → processing workflow
  • Metadata resolution via external music APIs (e.g., Spotify / MusicBrainz)
  • Clean, modular ETL design (ingest → process → embed → store → clean)
  • Vector database support using pgvector
  • Tools for PCA, UMAP, and similarity search
  • Track-level robustness: retries, status flags, resumable processing
  • Scales to thousands of songs with small GPU or CPU-only (slower)

📦 Project Structure

src/
  ingest/
    song_list_ingest.py        # Import initial list of songs into DB
    metadata_resolver.py       # Match titles to canonical track metadata
  audio/
    fetcher.py                 # Fetch audio from a legal source
    preprocess.py              # Load, resample, and chunk audio
  models/
    mert_embedder.py           # MERT embedding pipeline
  db/
    schema.sql                 # Track + embedding tables (Postgres/pgvector)
    repository.py              # Database interaction layer
  workers/
    process_track_worker.py    # Main track-processing workflow
  analysis/
    pca_umap.py                # Dimensionality reduction tools
    visualize.ipynb            # Notebook for plots & clustering
README.md

🎼 Data Flow Overview

Input: list of song titles
Output: (track_metadata, embedding_vector) saved in DB

Pipeline:

  1. Ingest: read titles → resolve canonical metadata → create DB rows
  2. Audio Fetch: retrieve audio from a legal source
  3. Chunk: split track into fixed-length segments (e.g., 10–15 sec)
  4. Embed: compute MERT embeddings per chunk → pool into one track vector
  5. Store: save vector + metadata → mark status = "done"
  6. Clean: delete audio file and temporary data

🧠 Embeddings (MERT-v1-95M)

MERT is a transformer trained on large music datasets. Its embeddings capture high-level musical properties such as:

  • Genre
  • Mood / emotional tone
  • Timbre & instrumentation
  • Production style
  • Acoustic vs electronic characteristics

MERT-v1-95M is chosen for:

  • Good semantic quality
  • Fast embedding
  • Low hardware requirements
  • Easy deployment (HuggingFace Transformers)

🗄️ Database Schema

Recommended backend: Postgres + pgvector.

Tracks Table

CREATE TABLE tracks (
    id SERIAL PRIMARY KEY,
    title TEXT,
    artist TEXT,
    album TEXT,
    release_year INT,
    genres TEXT[],
    isrc TEXT,
    source_id TEXT,
    duration_ms INT,
    embedding VECTOR(768),
    status TEXT DEFAULT 'pending',
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

⚙️ Installation

git clone <your_repo_url>
cd music-embedding
pip install -r requirements.txt

Requirements include:

  • torch
  • torchaudio
  • transformers
  • librosa
  • psycopg2 or asyncpg
  • pgvector (server extension)
  • numpy, pandas
  • umap-learn (optional)

▶️ Usage

1. Import song list

python src/ingest/song_list_ingest.py data/songs.csv

CSV format:

title,artist
Blinding Lights,The Weeknd
Shape of You,Ed Sheeran
...

2. Resolve metadata

python src/ingest/metadata_resolver.py

3. Process songs (the embedding pipeline)

python src/workers/process_track_worker.py

This:

  • Fetches audio
  • Chunks it
  • Runs MERT
  • Stores embeddings + metadata
  • Deletes audio

4. Run PCA or UMAP

python src/analysis/pca_umap.py

Or use the notebook:

jupyter notebook src/analysis/visualize.ipynb

📊 Visualizations

This project includes tools to produce:

  • 2D PCA plots
  • UMAP embedding maps
  • Clustering diagrams
  • Genre or artist-overlaid scatterplots

These help validate that the embedding space captures meaningful musical structure.


⚖️ Important Notes on Audio Sources

You must ensure you have the legal right to download, store, or process audio. This project does not include code to download copyrighted audio from YouTube or other services in violation of their Terms of Service.

Safe options include:

  • Legally purchased files
  • Public-domain audio
  • Licensed datasets
  • Short previews (if allowed under applicable API terms)

🚀 Scaling

This architecture supports incremental scaling through:

  • Job queues (Redis, RQ, Celery, etc.)
  • Parallel workers
  • Docker-based batch jobs
  • Cloud GPU instances (optional; not required for a 3k–5k track catalog)

The system is designed to be modular so you can swap:

  • Audio sources
  • Embedding models
  • Metadata providers
  • Database backends

Without rewriting core logic.


🧩 Future Extensions

  • Add MuLan or larger MERT models for richer embeddings
  • Integrate pgvector similarity search (kNN neighbors)
  • Build an interactive music map frontend (Plotly, Streamlit, or web)
  • Analyze playlists, charts, decades, genres
  • Build a recommendation engine using nearest neighbors

🔮 Future Work: Section-Level Embeddings (Intro / Verse / Chorus / Bridge / Outro)

One natural extension of this project is to go beyond track-level embeddings and instead analyze music at the level of song sections. Because the pipeline already processes each track in 10-second chunks, we can attach additional structure and semantics to these chunks, enabling several powerful analytical and visualization capabilities.

🧩 1. Automatic or semi-automatic section labeling

Each chunk can be labeled as:

  • Intro
  • Verse
  • Chorus
  • Bridge
  • Outro
  • (Optional: "Pre-chorus", "Drop", etc.)

Section identification could be implemented using:

  • simple heuristics (e.g., first 20–30 seconds = intro)
  • audio pattern analysis
  • supervised models trained for structure segmentation
  • alignment to externally provided timestamps (e.g., if metadata exists)

These labels allow the database to store both entire-track embeddings and section-specific embeddings.

🎼 2. Section-specific PCA/UMAP

With section labels, we can generate specialized semantic maps:

  • Chorus-only PCA
  • Verse-only PCA
  • Intro-only embeddings
  • Bridge clustering maps

Since choruses tend to encode a song's "core identity," analyzing only these chunks might produce even cleaner genre and mood clusters than full-track embeddings.

This unlocks experiments like:

  • Comparing verse vs. chorus semantic spaces
  • Seeing whether choruses cluster more tightly across genres
  • Studying how intros set up (or mislead) listener expectations

🔍 3. "Click-through" song inspection

For any song, we can visualize where each of its chunks sits in embedding space.

Example:

  • Chorus chunks cluster tightly together
  • Verse chunks form their own local neighborhood
  • Bridges may sit farther away if the musical texture changes
  • Outros may drift outwards due to reduced instrumentation or fading energy

This produces a micro-map for each song, showing how different sections relate to the overall music space and to each other.

Potential interfaces include:

  • scatterplots of a song's chunks overlaid on the global PCA
  • animated trajectories through UMAP space (a "semantic timeline")
  • radial maps of section distances
  • comparison across songs or artists

🎨 4. Applications & insights

Section-level embedding analysis enables several new angles:

  • Song structure visualization
  • Comparing how different artists structure choruses
  • Detecting "genre hybrids" where sections differ drastically
  • Studying production techniques (e.g., bright chorus vs dark verse)
  • Creating playlists using only specific song sections
  • Analyzing emotional arcs or dynamic shifts inside a track

This adds a rich new dimension to the project: Not just what songs sound like, but how songs evolve over time.

🚀 Summary

Section-level embeddings extend the project from static, track-level representations to a much more dynamic and musicologically interesting space. By labeling and embedding each section independently, the pipeline can reveal how songs are structured, why they feel the way they do, and how different parts contribute to a song's identity.


🔮 Future Work: Lyric Embeddings & Semantic Meaning Mapping

While MERT provides a powerful representation of how a song sounds, it does not capture what a song is about—the meaning and sentiment carried by its lyrics. Adding a parallel pipeline for lyric embeddings enables an entirely new dimension of analysis and produces a richer, more complete semantic map of music.

✍️ 1. Lyric embeddings (text-based semantic vectors)

The idea is to process each song's lyrics using a state-of-the-art text embedding model, such as:

  • OpenAI text embedding models
  • Sentence-BERT / MPNet
  • Cohere embeddings
  • Instructor embeddings
  • Other transformer-based text encoders

These models produce dense vectors that encode:

  • themes (love, heartbreak, rebellion, etc.)
  • narrative style
  • sentiment (positive/negative/bittersweet)
  • imagery and metaphors
  • language properties (abstract vs concrete, simple vs poetic)

Each song would therefore have two parallel embeddings:

  1. Audio embedding (MERT)
  2. Lyric embedding (text model)

🧭 2. Lyric-only PCA / UMAP

Just like audio embeddings, lyric vectors can be visualized in 2D or 3D:

  • Clusters of songs with similar themes
  • "Sad" vs "happy" axes
  • Pop vs hip-hop vs folk storytelling styles
  • Genre-independent semantic groupings

This often produces a radically different map from the audio PCA, revealing how songs relate conceptually even when they sound nothing alike.

🔗 3. Combining audio + lyric embeddings

Once both embeddings exist, they can be combined in multiple ways:

  • Concatenation: [audio || lyrics]
  • Weighted sum: α * audio + β * lyrics
  • Learned projection: small neural network mapping the two spaces into a shared space

This produces a multi-modal music embedding that captures both:

  • What a song sounds like
  • What a song means

Such hybrid embeddings can surface relationships invisible to single-modality models.

🔍 4. Audio–lyric interplay analysis

With dual embeddings, we can study fascinating patterns:

  • Songs with happy sound + sad lyrics
  • Genre clusters vs. thematic clusters
  • Artists whose sound rarely matches their lyrical themes
  • Decade-level trends in sound vs meaning
  • Emotion contrast inside albums or genres
  • Cross-cultural lyrical similarities for musically different songs

This opens up musicological research directions and uniquely rich visualizations.

🎧 5. Lyric-based filtering, clustering, and search

The system could support:

  • thematic playlist generation (e.g., "heartbreak songs", "uplifting songs")
  • similarity search based on lyrics only
  • identifying lyrical twins of sonically different tracks
  • building maps colored by lyrical sentiment or topic distributions

These features complement audio-based maps and recommendations.

🚀 Summary

Adding lyric embeddings transforms the project into a multi-modal semantic analysis tool. With both sound-based and meaning-based vectors, you gain the ability to explore music from two orthogonal perspectives and combine them into a unified representation. This extension dramatically increases the interpretive power and creative applications of the system.


📄 License

(Insert your chosen license here.)

About

Embed music with MERT and visualize with dimensionality reduction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors