🎵 Music Embedding Pipeline

Semantic audio embeddings + PCA/UMAP visualization using MERT

This project builds a searchable, analyzable vector-space representation of popular music. Given a list of song titles, the system:

Resolves each song into canonical metadata (artist, album, year, etc.)
Obtains audio from a legal source
Extracts semantic music embeddings using MERT-v1-95M
Stores metadata and embeddings in a database (Postgres + pgvector recommended)
Deletes audio files after embedding
Provides tools for PCA/UMAP visualization and similarity search

The goal is to produce a "map of music" where similar songs cluster naturally based on genre, mood, timbre, instrumentation, and production style.

✨ Features

Music-specific semantic embeddings using MERT (Music Encoder Representations from Transformers)
Robust audio ingestion → chunking → processing workflow
Metadata resolution via external music APIs (e.g., Spotify / MusicBrainz)
Clean, modular ETL design (ingest → process → embed → store → clean)
Vector database support using pgvector
Tools for PCA, UMAP, and similarity search
Track-level robustness: retries, status flags, resumable processing
Scales to thousands of songs with small GPU or CPU-only (slower)

📦 Project Structure

src/
  ingest/
    song_list_ingest.py        # Import initial list of songs into DB
    metadata_resolver.py       # Match titles to canonical track metadata
  audio/
    fetcher.py                 # Fetch audio from a legal source
    preprocess.py              # Load, resample, and chunk audio
  models/
    mert_embedder.py           # MERT embedding pipeline
  db/
    schema.sql                 # Track + embedding tables (Postgres/pgvector)
    repository.py              # Database interaction layer
  workers/
    process_track_worker.py    # Main track-processing workflow
  analysis/
    pca_umap.py                # Dimensionality reduction tools
    visualize.ipynb            # Notebook for plots & clustering
README.md

🎼 Data Flow Overview

Input: list of song titles
Output: (track_metadata, embedding_vector) saved in DB

Pipeline:

Ingest: read titles → resolve canonical metadata → create DB rows
Audio Fetch: retrieve audio from a legal source
Chunk: split track into fixed-length segments (e.g., 10–15 sec)
Embed: compute MERT embeddings per chunk → pool into one track vector
Store: save vector + metadata → mark status = "done"
Clean: delete audio file and temporary data

🧠 Embeddings (MERT-v1-95M)

MERT is a transformer trained on large music datasets. Its embeddings capture high-level musical properties such as:

Genre
Mood / emotional tone
Timbre & instrumentation
Production style
Acoustic vs electronic characteristics

MERT-v1-95M is chosen for:

Good semantic quality
Fast embedding
Low hardware requirements
Easy deployment (HuggingFace Transformers)

🗄️ Database Schema

Recommended backend: Postgres + pgvector.

Tracks Table

CREATE TABLE tracks (
    id SERIAL PRIMARY KEY,
    title TEXT,
    artist TEXT,
    album TEXT,
    release_year INT,
    genres TEXT[],
    isrc TEXT,
    source_id TEXT,
    duration_ms INT,
    embedding VECTOR(768),
    status TEXT DEFAULT 'pending',
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

⚙️ Installation

git clone <your_repo_url>
cd music-embedding
pip install -r requirements.txt

Requirements include:

torch
torchaudio
transformers
librosa
psycopg2 or asyncpg
pgvector (server extension)
numpy, pandas
umap-learn (optional)

▶️ Usage

1. Import song list

python src/ingest/song_list_ingest.py data/songs.csv

CSV format:

title,artist
Blinding Lights,The Weeknd
Shape of You,Ed Sheeran
...

2. Resolve metadata

python src/ingest/metadata_resolver.py

3. Process songs (the embedding pipeline)

python src/workers/process_track_worker.py

This:

Fetches audio
Chunks it
Runs MERT
Stores embeddings + metadata
Deletes audio

4. Run PCA or UMAP

python src/analysis/pca_umap.py

Or use the notebook:

jupyter notebook src/analysis/visualize.ipynb

📊 Visualizations

This project includes tools to produce:

2D PCA plots
UMAP embedding maps
Clustering diagrams
Genre or artist-overlaid scatterplots

These help validate that the embedding space captures meaningful musical structure.

⚖️ Important Notes on Audio Sources

You must ensure you have the legal right to download, store, or process audio. This project does not include code to download copyrighted audio from YouTube or other services in violation of their Terms of Service.

Safe options include:

Legally purchased files
Public-domain audio
Licensed datasets
Short previews (if allowed under applicable API terms)

🚀 Scaling

This architecture supports incremental scaling through:

Job queues (Redis, RQ, Celery, etc.)
Parallel workers
Docker-based batch jobs
Cloud GPU instances (optional; not required for a 3k–5k track catalog)

The system is designed to be modular so you can swap:

Audio sources
Embedding models
Metadata providers
Database backends

Without rewriting core logic.

🧩 Future Extensions

Add MuLan or larger MERT models for richer embeddings
Integrate pgvector similarity search (kNN neighbors)
Build an interactive music map frontend (Plotly, Streamlit, or web)
Analyze playlists, charts, decades, genres
Build a recommendation engine using nearest neighbors

🔮 Future Work: Section-Level Embeddings (Intro / Verse / Chorus / Bridge / Outro)

One natural extension of this project is to go beyond track-level embeddings and instead analyze music at the level of song sections. Because the pipeline already processes each track in 10-second chunks, we can attach additional structure and semantics to these chunks, enabling several powerful analytical and visualization capabilities.

🧩 1. Automatic or semi-automatic section labeling

Each chunk can be labeled as:

Intro
Verse
Chorus
Bridge
Outro
(Optional: "Pre-chorus", "Drop", etc.)

Section identification could be implemented using:

simple heuristics (e.g., first 20–30 seconds = intro)
audio pattern analysis
supervised models trained for structure segmentation
alignment to externally provided timestamps (e.g., if metadata exists)

These labels allow the database to store both entire-track embeddings and section-specific embeddings.

🎼 2. Section-specific PCA/UMAP

With section labels, we can generate specialized semantic maps:

Chorus-only PCA
Verse-only PCA
Intro-only embeddings
Bridge clustering maps

Since choruses tend to encode a song's "core identity," analyzing only these chunks might produce even cleaner genre and mood clusters than full-track embeddings.

This unlocks experiments like:

Comparing verse vs. chorus semantic spaces
Seeing whether choruses cluster more tightly across genres
Studying how intros set up (or mislead) listener expectations

🔍 3. "Click-through" song inspection

For any song, we can visualize where each of its chunks sits in embedding space.

Example:

Chorus chunks cluster tightly together
Verse chunks form their own local neighborhood
Bridges may sit farther away if the musical texture changes
Outros may drift outwards due to reduced instrumentation or fading energy

This produces a micro-map for each song, showing how different sections relate to the overall music space and to each other.

Potential interfaces include:

scatterplots of a song's chunks overlaid on the global PCA
animated trajectories through UMAP space (a "semantic timeline")
radial maps of section distances
comparison across songs or artists

🎨 4. Applications & insights

Section-level embedding analysis enables several new angles:

Song structure visualization
Comparing how different artists structure choruses
Detecting "genre hybrids" where sections differ drastically
Studying production techniques (e.g., bright chorus vs dark verse)
Creating playlists using only specific song sections
Analyzing emotional arcs or dynamic shifts inside a track

This adds a rich new dimension to the project: Not just what songs sound like, but how songs evolve over time.

🚀 Summary

Section-level embeddings extend the project from static, track-level representations to a much more dynamic and musicologically interesting space. By labeling and embedding each section independently, the pipeline can reveal how songs are structured, why they feel the way they do, and how different parts contribute to a song's identity.

🔮 Future Work: Lyric Embeddings & Semantic Meaning Mapping

While MERT provides a powerful representation of how a song sounds, it does not capture what a song is about—the meaning and sentiment carried by its lyrics. Adding a parallel pipeline for lyric embeddings enables an entirely new dimension of analysis and produces a richer, more complete semantic map of music.

✍️ 1. Lyric embeddings (text-based semantic vectors)

The idea is to process each song's lyrics using a state-of-the-art text embedding model, such as:

OpenAI text embedding models
Sentence-BERT / MPNet
Cohere embeddings
Instructor embeddings
Other transformer-based text encoders

These models produce dense vectors that encode:

themes (love, heartbreak, rebellion, etc.)
narrative style
sentiment (positive/negative/bittersweet)
imagery and metaphors
language properties (abstract vs concrete, simple vs poetic)

Each song would therefore have two parallel embeddings:

Audio embedding (MERT)
Lyric embedding (text model)

🧭 2. Lyric-only PCA / UMAP

Just like audio embeddings, lyric vectors can be visualized in 2D or 3D:

Clusters of songs with similar themes
"Sad" vs "happy" axes
Pop vs hip-hop vs folk storytelling styles
Genre-independent semantic groupings

This often produces a radically different map from the audio PCA, revealing how songs relate conceptually even when they sound nothing alike.

🔗 3. Combining audio + lyric embeddings

Once both embeddings exist, they can be combined in multiple ways:

Concatenation: [audio || lyrics]
Weighted sum: α * audio + β * lyrics
Learned projection: small neural network mapping the two spaces into a shared space

This produces a multi-modal music embedding that captures both:

What a song sounds like
What a song means

Such hybrid embeddings can surface relationships invisible to single-modality models.

🔍 4. Audio–lyric interplay analysis

With dual embeddings, we can study fascinating patterns:

Songs with happy sound + sad lyrics
Genre clusters vs. thematic clusters
Artists whose sound rarely matches their lyrical themes
Decade-level trends in sound vs meaning
Emotion contrast inside albums or genres
Cross-cultural lyrical similarities for musically different songs

This opens up musicological research directions and uniquely rich visualizations.

🎧 5. Lyric-based filtering, clustering, and search

The system could support:

thematic playlist generation (e.g., "heartbreak songs", "uplifting songs")
similarity search based on lyrics only
identifying lyrical twins of sonically different tracks
building maps colored by lyrical sentiment or topic distributions

These features complement audio-based maps and recommendations.

🚀 Summary

Adding lyric embeddings transforms the project into a multi-modal semantic analysis tool. With both sound-based and meaning-based vectors, you gain the ability to explore music from two orthogonal perspectives and combine them into a unified representation. This extension dramatically increases the interpretive power and creative applications of the system.

📄 License

(Insert your chosen license here.)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
site		site
songs_database		songs_database
.cache		.cache
.gitignore		.gitignore
README.md		README.md
data_visualization.py		data_visualization.py
main.py		main.py
pca_visualization.png		pca_visualization.png
requirements.txt		requirements.txt
verify_embeddings.py		verify_embeddings.py

Folders and files

Latest commit

History

Repository files navigation

🎵 Music Embedding Pipeline

✨ Features

📦 Project Structure

🎼 Data Flow Overview

🧠 Embeddings (MERT-v1-95M)

🗄️ Database Schema

Tracks Table

⚙️ Installation

▶️ Usage

1. Import song list

2. Resolve metadata

3. Process songs (the embedding pipeline)

4. Run PCA or UMAP

📊 Visualizations

⚖️ Important Notes on Audio Sources

🚀 Scaling

🧩 Future Extensions

🔮 Future Work: Section-Level Embeddings (Intro / Verse / Chorus / Bridge / Outro)

🧩 1. Automatic or semi-automatic section labeling

🎼 2. Section-specific PCA/UMAP

🔍 3. "Click-through" song inspection

🎨 4. Applications & insights

🚀 Summary

🔮 Future Work: Lyric Embeddings & Semantic Meaning Mapping

✍️ 1. Lyric embeddings (text-based semantic vectors)

🧭 2. Lyric-only PCA / UMAP

🔗 3. Combining audio + lyric embeddings

🔍 4. Audio–lyric interplay analysis

🎧 5. Lyric-based filtering, clustering, and search

🚀 Summary

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages