Skip to content

mjacobs/jamz

Repository files navigation

Jamz

Personal music listening history database consolidating Spotify and Last.fm data into AlloyDB Omni (PostgreSQL) for analysis.

Data Sources

  • Spotify: Extended streaming history export (2012-2020, ~89K plays)
  • Last.fm: Scrobble archive (2007-2016, ~31K scrobbles)

Setup

# Start AlloyDB Omni container
docker compose up -d

# Jamz maps host port 55432 -> container 5432
export JAMZ_DB_PORT=55432

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Optional: load DB connection defaults
cp .env.example .env
set -a && source .env && set +a

Usage

Import

# Run full import (drops and recreates all tables)
python -m scripts_pg.main --fresh

# Import only one source
python -m scripts_pg.main --fresh --spotify-only
python -m scripts_pg.main --fresh --lastfm-only

# Skip duplicate detection
python -m scripts_pg.main --fresh --skip-dedup

# Use different dedup strategy
python -m scripts_pg.main --fresh --dedup-strategy keep_lastfm

Query

# Fuzzy text search (pg_trgm)
python -m scripts_pg.query search "radiohead"
python -m scripts_pg.query search "radiohead" --type artist
python -m scripts_pg.query search "kid a" --type album --min-similarity 0.3

# Generate vector embeddings (run once, or --refresh to regenerate)
python -m scripts_pg.query embed
python -m scripts_pg.query embed --type artist

# Semantic similarity search (requires embeddings)
python -m scripts_pg.query similar "shoegaze"
python -m scripts_pg.query similar "90s alternative" --type artist

API

python -m uvicorn scripts_pg.api:app --reload --port 9824

Web Dashboard

cd web
npm install
npm run dev

AI / Embeddings Demo

Generate embeddings and run semantic similarity search using pgvector:

uv run python -m scripts_pg.query embed
uv run python -m scripts_pg.query similar "shoegaze"
uv run python -m scripts_pg.query similar "90s alternative" --type artist
uv run python -m scripts_pg.query similar "rainy morning piano" --type track

Architecture & Local Setup

Jamz uses a simple split between a FastAPI backend and a React + Vite frontend:

  • API server (FastAPI): serves JSON endpoints for stats, search, trends, and plays.
  • Web dashboard (Vite + React): loads data from the API and renders charts/tables.

Local ports (kept uncommon to avoid conflicts):

  • Frontend dev server: http://localhost:9823
  • API server: http://localhost:9824

The frontend reads the API base URL from web/.env (VITE_API_BASE_URL). The API enables CORS for http://localhost:9823 so the browser can call it.

Troubleshooting

  • CORS errors: ensure the API is running on port 9824 and restart it after changes.
  • 500 errors on trends/top lists: check the API console traceback; most issues are SQL errors.
  • Recent plays show but charts do not: charts use /trends/* endpoints; verify them with SQL below.
  • Ports already in use: change the API port in the uvicorn command and update web/.env.

Direct SQL

docker exec -it jamz-db psql -U postgres -d jamz
-- Top artists
SELECT artist_name, total_plays, total_minutes_played
FROM v_artist_stats
ORDER BY total_plays DESC
LIMIT 20;

-- Plays by year
SELECT EXTRACT(YEAR FROM played_at)::int AS year, source, COUNT(*)
FROM plays
WHERE NOT is_duplicate
GROUP BY year, source
ORDER BY year;

-- Most played tracks
SELECT artist_name, track_title, COUNT(*) as plays
FROM v_plays_full
WHERE NOT is_duplicate
GROUP BY artist_name, track_title
ORDER BY plays DESC
LIMIT 20;

-- Plays by hour (trend chart validation)
SELECT EXTRACT(HOUR FROM played_at)::text as time_unit, COUNT(*) as plays
FROM v_plays_full
WHERE NOT is_duplicate
GROUP BY EXTRACT(HOUR FROM played_at)
ORDER BY EXTRACT(HOUR FROM played_at);

Schema

artists          # Dimension: artist entities
albums           # Dimension: album entities
tracks           # Dimension: track entities
artist_aliases   # Mapping: normalized name -> artist_id
plays            # Fact: individual play events
duplicate_pairs  # Tracking: detected duplicates
embeddings       # Vector embeddings for semantic search
musicbrainz_tags # MusicBrainz tag enrichment

v_plays_full     # View: denormalized plays with artist/track/album
v_artist_stats   # View: aggregated artist statistics

Data Directory Structure

data/
├── spotify/
│   └── Streaming_History_Audio_*.json
└── lastfm/
    └── <username>/
        └── YYYY/MM/DD/*.gz

Entity Resolution

Tracks and artists are matched using normalized names:

  • Lowercase, strip accents
  • Remove remaster/deluxe/live suffixes
  • Handle "The X" vs "X" for artists

Duplicate Detection

Plays in the overlap period (2012-2016) are checked for duplicates based on:

  • Same track (post-entity-resolution)
  • Time proximity: <60s (95% confidence), <180s (85%), <300s (70%)

Default strategy keeps Spotify plays (richer metadata) and marks Last.fm as duplicates.

MusicBrainz Enrichment

Jamz can fetch MusicBrainz tags (genres) and store them in the database to enrich embeddings. This lets semantic search use more than just names.

  1. Set a user agent (required by MusicBrainz):
export JAMZ_MB_USER_AGENT="Jamz/0.1 (you@example.com)"
  1. Fetch tags for entities that already have MusicBrainz IDs:
uv run python -m scripts_pg.musicbrainz --type artist --limit 200
uv run python -m scripts_pg.musicbrainz --type album --limit 200
uv run python -m scripts_pg.musicbrainz --type track --limit 200

2b) Backfill MusicBrainz IDs when artists/albums are missing them:

uv run python -m scripts_pg.musicbrainz_backfill --type artist --limit 200
uv run python -m scripts_pg.musicbrainz_backfill --type album --limit 200
  1. See coverage of MusicBrainz IDs and tags:
uv run python -m scripts_pg.musicbrainz --report
uv run python -m scripts_pg.musicbrainz --report --report-csv musicbrainz-coverage.csv
  1. Regenerate embeddings so tags are included in the text input:
uv run python -m scripts_pg.query embed --refresh

Notes:

  • If this is an existing database, run psql -f scripts_pg/schema.sql once to add the new tables. For Docker, use: docker exec -i jamz-db psql -U postgres -d jamz < scripts_pg/schema.sql
  • The MusicBrainz API is rate-limited. The fetcher sleeps between requests.
  • Albums use the MusicBrainz "release-group" endpoint; tracks use "recording".
  • Backfill uses MusicBrainz search and a minimum score threshold (default: 90). Tune with --min-score.

About

Personal music listening history database consolidating Spotify and Last.fm data into AlloyDB Omni (PostgreSQL) for analysis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors