feat(2.4.0): community ad patterns + 49-tag vocabulary + authoritative sponsor seed#224
Merged
Conversation
…e sponsor seed Adds crowdsourced ad pattern sharing for MinusPod. Patterns can be shared via patterns/community/ in the repo, auto-pulled on a configurable cron, and filtered against each podcast by a shared tag vocabulary so a pattern for "Squarespace" never enters the matching loop on a podcast tagged only kids_family. Reviewer-time bound adjustments now feed back into pattern text on a configurable threshold. A GitHub Action validates community PRs against the same gates and a three-tier dedupe before merge. Why now: today patterns are local-only and grow per-instance; users with low pattern counts get poor coverage. The schema, matcher, and reviewer paths already existed but assumed single-instance ownership. This makes patterns shareable without sacrificing the per-user customizations the reviewer flow already accumulates. Schema (migration is additive + idempotent): - ad_patterns: source, community_id, version, submitted_app_version, protected_from_sync - known_sponsors: tags (JSON array) - podcasts: tags, user_tags - episodes: tags - New indexes idx_patterns_source + idx_patterns_community_id - Sponsor reseed from src/seed_data/sponsors_final.csv (255 entries, authoritative) Preserves FKs by UPDATE-by-name; soft-deletes orphans (is_active=0) Backend: - TextPatternMatcher: tag-eligibility filter applied only to community patterns - PatternService.rewrite_pattern_from_bounds + import_community_pattern - community_export.py: quality gates, PII strip, sponsor classification, prefilled GitHub PR URL with 7KB fallback - community_sync.py: cron-driven manifest fetch + apply (INSERT/UPDATE/DELETE respecting protected_from_sync) - tools/community_pattern_validator.py: CLI + library for CI validation - RSS itunes:category parsing wired into refresh path - New /api/v1/ endpoints for bulk ops, submit-to-community, protect/unprotect, feed tags, sponsor tags, reviewer settings, community sync, sync status Frontend: - PatternsPage: source filter, Import/Export header, Submit-to-community + Protect-from-sync row actions, community badge, last-synced indicator - TagChips, CommunityBadge, PatternImportDialog components - AdReviewerSection + CommunityPatternsSection settings panels CI / repo: - .github/workflows/validate-community-patterns.yml runs the validator and posts a Markdown comment on each community PR - .github/labeler.yml auto-labels community PRs with `pattern` - patterns/community/ with README, empty index.json, examples/ Tests: +146 new tests (unit + integration). Whole suite is green (1159 passed, 4 skipped). /simplify pass applied: PATTERN_SOURCES constants centralized, bulk-op handler extracted, indexes promoted to SCHEMA_SQL, source filter pushed into SQL, N+1 batched in apply_manifest, set_podcast_tags short-circuits when unchanged, time helpers reused. Closes the work tracked in IMPLEMENTATION_PLAN.md (delivered via pastebin sloth-fox-spider).
Owner
Author
Code reviewFound 1 issue:
MinusPod/src/community_sync.py Lines 100 to 108 in 94e0d83 Suggest assigning unconditionally ( Generated with Claude Code If this code review was useful, please react with 👍. Otherwise, react with 👎. |
- community_sync.apply_manifest: assign `version` from the manifest entry unconditionally instead of `setdefault`. The comment promised override semantics; the code had `setdefault`, which silently kept a stale `version` carried inside the inner `data` dict and broke the version-gate in `pattern_service.import_community_pattern`. - schema._create_new_tables_only: bring inline CREATE TABLE blocks for `ad_patterns` (source, community_id, version, submitted_app_version, protected_from_sync) and `known_sponsors` (tags) back in sync with SCHEMA_SQL. End state was already correct via the ALTER TABLE migrations, but the "must match SCHEMA_SQL exactly" comment was no longer accurate; future readers would have been misled. - schema._run_schema_migrations: defer `_reseed_known_sponsors` to AFTER `_migrate_sponsor_fk` + the Zyn cleanups. On a v2.1.x -> 2.4.0 jump the prior ordering tagged a case-variant row before the FK migration deduped case-variants, which could discard the freshly-tagged row. The reseed now operates on the canonical post-FK-migration state. No new tests required - the existing migration and sync tests already cover idempotency, version-gating, and FK preservation, and all 1159 tests still pass.
CodeQL flagged two `py/reflective-xss` high-severity alerts on the new bulk-delete / bulk-disable endpoints because user-supplied `ids` and `expected_count` were reflected back in the JSON response without type-coercion. Both responses are emitted via `jsonify` (so the real-world XSS risk is bounded), but accepting arbitrary types from input is also bad input validation — the call should hard-reject non-integer ids and non-integer expected_counts. In _resolve_bulk_target: - `expected_count` is now cast to int up-front; non-int payloads return 400 instead of being f-stringed into the error response. - User-supplied `ids` are cast element-wise to int; non-int contents return 400 instead of being passed to bulk_delete / bulk_disable (which would then have failed downstream with an opaque SQL error). No test changes — the new validation tightens accepted input shape without changing legitimate-caller behavior.
…rate workflow Restructures the patterns/ directory to match the documentation plan: docs at the patterns/ root, pattern JSON files (and the manifest) under patterns/community/. patterns/community/README.md is gone, replaced by the technical reference at patterns/README.md. New files: - patterns/CONTRIBUTING.md — submitter-facing PR explainer (stripped fields, PII rules, quality gates, what the validator does) - patterns/README.md — technical reference (sync mechanics, manifest + pattern file formats, tag vocabulary, reviewer workflow, ops) - patterns/vocabulary.json — machine-readable copy of the 49-tag vocabulary; generated from src/seed_data/tag_vocabulary.csv plus the hardcoded UNIVERSAL_TAG - .github/workflows/regenerate-manifest.yml — on push to main that touches patterns/community/**.json, rebuilds index.json and commits it back. Concurrency-gated so back-to-back PR merges don't race. - src/tools/generate_manifest.py — companion module the workflow runs; also invocable locally via `python -m src.tools.generate_manifest`. Updates: - root README.md — replace the existing community section with the user-facing "Community Patterns (Optional)" section (opt-in framing, what you get / control / share, links to deeper docs). TOC entry retargeted. - .github/workflows/validate-community-patterns.yml — invoke validator via `python -m src.tools.community_pattern_validator` instead of setting PYTHONPATH=src, matching the new generate_manifest call shape. - src/tools/community_pattern_validator.py — Markdown comment now links back to the Quality checks / Dedupe / sponsor-add sections of the new docs, so submitters self-serve on what failed and why. Stale references in patterns/README.md were fixed during the /humanizer pass: the sponsor seed source is now `src/seed_data/sponsors_final.csv` (loaded by the 2.4.0 migration), not the deprecated `SEED_SPONSORS` constant; the vocabulary source is `src/seed_data/tag_vocabulary.csv` (read by `src/utils/community_tags.py`), not the never-introduced `VALID_TAGS` Python constant. All 1138 tests pass (test_api.py:21 errors are a pre-existing local permission issue against /app, unrelated to this PR; CI run on prior commits has these tests passing).
Audited patterns/CONTRIBUTING.md, patterns/README.md, root README's Community Patterns section, and CHANGELOG.md 2.4.0 entry against every rule in the /humanizer skill (Wikipedia "Signs of AI writing"). Edits: - 13 curly quotes/apostrophes replaced with straight in the two patterns/ docs (rule 15) - CONTRIBUTING.md Sponsor-validation section: three-bullet bold-label em-dash list converted to a prose paragraph (rules 13, 14) - patterns/README.md Pattern file Fields list: ten em dashes replaced with ' - ' separators, matching the README Experiments-section style already used elsewhere in the repo (rule 13) - patterns/README.md Tag categories: ' — ' inside the section preamble replaced with '; '; the Special-tag em dash replaced with a parenthetical (rule 13) - patterns/README.md Podcast tagging: three-bullet bold-label em-dash list converted to prose (rules 13, 14) - patterns/README.md API list: two em dashes replaced with ' - ' - root README "What you control" list: four em dashes replaced with ' - ', matching the project's existing labeled-list style Final em-dash count in all four touched docs: 0. Curly-quote count: 0. No emojis, no AI-vocabulary words (delve/leverage/comprehensive/robust/ etc.), no inflated symbolism, no negative parallelisms. Content unchanged; only stylistic AI-tell cleanup.
The backend already exposed GET / PUT /feeds/{slug}/tags but no UI
surfaced them, so user-added tags couldn't be set without curl. Adds
the missing pieces:
- GET /api/v1/tags/vocabulary - new endpoint returning the canonical
49-tag vocabulary plus per-tag descriptions, grouped into
podcast_genres + sponsor_industries + special_tags. The frontend
picker needs the descriptions for tooltips and the grouping for the
<optgroup>s in the add-tag dropdown.
- frontend/src/components/FeedTagsEditor.tsx - new component. Shows
effective tags grouped by source (RSS / episode / user), each layer
rendered as TagChips. User-added tags have an X button to remove
them. An "+ Add tag" button opens a grouped <select> of the
remaining vocabulary tags; selection auto-saves via
setFeedUserTags.
- frontend/src/api/community.ts - getTagVocabulary() + TagVocabulary
type.
- frontend/src/pages/FeedDetail.tsx - mounts the editor as a card
directly above the Episodes section.
No tests added for the React component (project doesn't have a Jest /
RTL suite); the existing GET/PUT feed-tag API has integration coverage
in test_community_pattern_flow.py.
A side-by-side audit against IMPLEMENTATION_PLAN.md section 10 found two UI surfaces that were partially built: 1. Submit-to-community / Protect-from-sync row actions existed in the mobile card layout but were missing from the desktop table. Added a ninth "Actions" column with the per-row buttons (Submit for local, Protect/Unprotect for community). Rebalanced colgroup widths so the table doesn't blow past 100%. 2. The Export control was a plain <a> that downloaded the whole local pattern DB; the plan called for "Opens a modal with a multi-select pattern list" (section 10, line 291). Added PatternExportDialog with a checkbox per row, Select-all toggle, optional include-disabled / include-corrections flags, and download via the existing /patterns/export endpoint. Extended that endpoint to accept ?ids=1,2,3 for selecting a subset. The dialog initializes its selection from the currently-filtered patterns prop. To stay clean of the React Compiler's preserve-manual-memoization rule, the outer component remounts the inner implementation on each open so useState's initializer re-syncs; no useEffect prop->state writes. Both buttons in the desktop table use the existing handlers (handleSubmitToCommunity / handleToggleProtect) shared with the mobile card path, so behavior is identical across breakpoints.
Findings from a /simplify audit of commits 82ff17e..8aa59b7 (the changes made after the initial 2.4.0 simplify pass): - Tag-vocabulary CSV was being parsed in three places (utils helper, one-shot vocabulary.json generator, and the new /tags/vocabulary endpoint). Extracted `utils.community_tags.vocabulary_payload()` and cached it with `@lru_cache(maxsize=1)`. The endpoint and the patterns/vocabulary.json regenerator now share one source of truth; vocabulary cannot drift between them. - Endpoint discoverability: `/tags/vocabulary` was on sponsors.py because that's where the other tag-CRUD routes lived, but the endpoint has no sponsor coupling — moved to a new src/api/tags.py and registered it in the Blueprint import line. A frontend dev hunting "where is the tag vocab endpoint" will now find it on the first grep. - Stringly-typed pattern sources: the TS side had `'local' | 'community' | 'imported'` literals at ~9 sites. Added `PATTERN_SOURCES` const + `PatternSource` type in api/patterns.ts (mirrors the existing PATTERN_SOURCES frozenset in utils/community_tags.py). Existing call sites still typecheck unchanged. - Vocabulary React Query staleTime: was 1h, but the vocabulary ships with the app image and can't change at runtime. Bumped to Infinity + gcTime: Infinity so we fetch once per page load, never refetch on focus. - tools/ sys.path bootstrap: created src/tools/__init__.py so the workflow-style `python -m src.tools.X` path explicitly wires src/ onto sys.path through the package init. The per-script sys.path.insert lines remain as a defensive fallback for direct `python script.py` invocation; they short-circuit when __init__.py has already done the work. - Duplicate WHY-comment in schema.py (one at the position the reseed call _used to_ live in the migration block, one at the call site after Zyn cleanups). Kept the call-site comment, trimmed the upstream one to a single-line pointer. - Dropped two narrating WHAT-comments (FeedDetail's "Feed-level tag editor" + PatternExportDialog's remount-trick re-explainer). The identifiers and JSX already say it. All 1138 backend tests still pass; frontend lint + build clean.
2.4.0 already shipped (Docker tag pushed, prod deployed). Subsequent commits 82ff17e..097a8e6 added material runtime changes that should not have ridden the same tag: - FeedTagsEditor + GET /tags/vocabulary endpoint (c0c0680) - Multi-select PatternExportDialog + /patterns/export ?ids= filter (8aa59b7) - Per-row Submit / Protect buttons in the desktop pattern table (8aa59b7) - Vocabulary caching + endpoint relocation (097a8e6) - TS PATTERN_SOURCES const (097a8e6) - community_sync version-stamp fix, CodeQL bulk-op coercion, SCHEMA_SQL drift fix, sponsor-reseed reorder (82ff17e, 4969436) Bumping to 2.4.1 so the registry tag matches the running code. CHANGELOG entry summarizes what's new in 2.4.1 vs 2.4.0 across Added / Changed / Fixed. openapi.yaml info.version also bumped. Going forward: every rebuild that ships new app code gets a fresh versioned tag.
External audit (pastebin otter-wasp-eel) flagged 5 gaps after I merged the v2.4.1 bump. All 5 fixed: 1. **Reviewer-trim now actually trims** instead of full-replacing the template. `rewrite_pattern_from_bounds` takes original + new bounds, computes the head/tail transcript slices, and splices them out of the existing template only when they appear at its start/end. The prior "Operation 2 full replace" risk (fitting the template to one episode's transcription) is gone. intro_variants / outro_variants get the same prefix/suffix trim treatment so they stay aligned. Returns False when neither head nor tail trim matches the template — defensive bail-out for misaligned input. 2. **Labeler workflow added.** `.github/labeler.yml` had the path-glob for `pattern` but no workflow invoked actions/labeler. `.github/workflows/labeler.yml` now wires it under `pull_request_target`. 3. **`vocabulary_version` is now read by the sync job.** sync_now compares `manifest['vocabulary_version']` against the app's value and emits a log warning + `summary['vocabulary_warning']` on mismatch. Informational only — vocabulary still ships with the app image. Bad / non-numeric values get a clean log warning instead of crashing the sync. 4. **`GITHUB_REPO` is now a single constant.** `community_export.py` and `community_sync.py` were hand-rolling separate copies of the upstream identity. New `GITHUB_REPO` + `COMMUNITY_MANIFEST_URL` in `utils/community_tags.py`; both call sites import from there. 5. **`extract_transcript_segment` no longer imported from api/.** `pattern_service.py` now imports `extract_text_in_range` from `utils.text` directly — the previous `from api import ...` was the wrong dependency direction (service layer leaning on api layer). Then /simplify pass over the gap fixes surfaced four more cleanups: - `VOCABULARY_VERSION` + `MANIFEST_VERSION` were defined in `tools/generate_manifest.py` (a build-time CLI) and the runtime `community_sync.py` was importing them across that wrong-direction boundary. Moved both to `utils/community_tags.py` where the vocabulary lives. Both `generate_manifest.py` and `community_sync.py` now import from there. `vocabulary_payload()` also uses the constant instead of a hardcoded literal `1`. - The broad `except Exception` around the version check is now scoped to `(TypeError, ValueError)` on the int cast. - Dropped `MANIFEST_URL = COMMUNITY_MANIFEST_URL` (pure indirection alias, no external consumer). - Dropped local `import json as _json` inside `rewrite_pattern_from_bounds` — module-level `import json` already in scope. - Extracted `_splice_prefix` / `_splice_suffix` module helpers in `pattern_service.py`. Template trim and the two variant-array rewrites all call them — duplicated whitespace+case-insensitive splice logic became one tested pair. 1139 backend tests pass (one more than before from the new trim test); frontend lint + build clean.
Pulled from grafana/loki on services03 while debugging a 14-minute pass-1 run on cordkillers-only-audio episode 98743f657890. 1. **Fingerprint slow-fallback timeout** — the full-file fpcalc call failed with "Invalid data found when processing input" (audio file has bytes fpcalc rejects but Whisper tolerates). The fallback then inherited the full 600-second timeout for per-window scanning, which uses the same fpcalc binary and produced 0 matches over 10 minutes. New `FALLBACK_SLOW_TIMEOUT = 90` caps the fallback. Stage 1 now skips fast on broken audio instead of stalling. 2. **`processing_timeouts._resolve` broken DB import** — imported `get_database` from the `database` package, but it actually lives in `api`. The ImportError was swallowed by the try/except, so user-configured `processing_soft_timeout_seconds` / `processing_hard_timeout_seconds` settings were silently shadowed by env-var / default fallbacks. Repaired the import path. 3. **`community_sync` 404 noise** — every 15-min background tick was logging a WARNING because the manifest URL points at main but `patterns/community/index.json` only exists on this feature branch. Downgraded the 404 case to INFO with an explanatory message; non-404 fetch failures still WARN. 4. **`set_podcast_tags` skips episode aggregation** when the incoming RSS tags are already a subset of the row's current union and user_tags isn't being touched. The dominant case on the refresh hot path (cordkillers refreshes had 338 episodes -> 338 JSON parses every 15 min for nothing). All 1139 tests still green. None of these required schema changes.
…tern validator, 12 seed patterns - Move per-pattern Submit-to-community into the Export dialog as a destination radio. Multi-select drives N prefilled-PR tabs in one click; community patterns are auto-excluded. - New DELETE /api/v1/community-patterns/all endpoint plus a Settings button to wipe every source=community pattern (and its audio_fingerprints rows). Requires confirm:true. - PR validator now rejects multi-sponsor blocks. Mirrors the export-side foreign-sponsor check; both now share find_foreign_sponsors() in community_export.py. - Seed patterns/community/ with 12 curated patterns from a real instance export (Capital One, Carvana x2, Instacart, Kayak, Mint Mobile, Monday.com, Progressive, SimpliSafe, Squarespace, ThreatLocker, Zyn). - Popup-blocker fix: open N blank tabs synchronously at click, redirect each once its submit returns. - Misc: PATTERN_SOURCE_* named consts, downloadBlob reuse, ASCII docstrings, ['communitySync'] cache invalidation on purge, 4 new validator unit tests.
Community pattern validationRejected (12)See Quality checks and Dedupe for what each gate enforces.
See |
CI checks out the PR branch, so files added by the PR are already on disk in patterns/community/ at run time. The validator's _load_existing_patterns() sweep was picking them up, then dedupe() compared each new doc against the identical-on-disk copy and rejected with score=1.00. The 12 seed patterns in 2.4.4 hit this on first push. Build pr_community_ids by reading each --pr-files arg and excluding any existing row whose community_id matches before validate_doc() runs. Regression test added.
Community pattern validationPassed (12)
Validation passed. Ready for review. See |
…[object Object] fix
The 2.4.4 submit-to-community flow opened one prefilled PR tab per pattern.
At scale that broke: out of 215 selected, only 8 tabs survived the popup
blocker, 20 forced JSON downloads, 187 returned 400s rendered as
'[object Object]' in the post-submit alert. None reached GitHub.
This release replaces the per-tab fan-out with a single bundle download
and adds two new endpoints to drive it:
- POST /api/v1/patterns/preview-export: dry-runs the quality gates and
returns {ready, rejected[{id, sponsor, reasons}], counts}.
- POST /api/v1/patterns/submit-bundle: returns one downloadable JSON file
containing every pattern that passed (format: minuspod-community-submission).
The contributor commits it into their fork and opens one PR.
The PR-side validator and manifest builder both handle the new bundle
format natively (one validation per entry, one manifest entry per
contained pattern), so maintainers don't have to split files on merge.
Other fixes:
- /community-patterns/sync returns 200 {status: no_manifest_yet} on
upstream 404 instead of 502 (e.g. patterns feature still on a branch).
- apiRequest's extractErrorMessage helper normalizes {error: {message,
reasons}} bodies to a string so failures no longer render as
[object Object] anywhere in the UI.
Simplify pass folded BUNDLE_FORMAT + iter_bundle_patterns into
utils/community_tags.py (was duplicated 3 places + 2 string literals),
added Database.get_ad_patterns_by_ids() batch helper used by build_bundle,
collapsed downloadCommunityBundle to trigger its own browser download, and
dropped the redundant stage state in PatternExportDialog (now derived).
Community pattern validationPassed (12)
Validation passed. Ready for review. See |
…nsive parse text_pattern_matcher passed json.dumps([intro]) into create_ad_pattern, which json.dumps'd it again. Every auto-created pattern's variants got stored as a JSON-string of a JSON-string. The 2.4.5 community bundle pipeline parsed that as a string and exploded it character-by-character (user's first bundle had intro_variants of length 196 starting with ['[', '"', 'E', 'm', ...]). Fixes: - Pass plain lists from text_pattern_matcher (root cause). - _safe_parse_variants in community_export retries the decode when the first parse returns a string, so bundles built from existing broken DBs still produce clean output. - One-shot _repair_double_encoded_variants migration in schema.py; re-encodes affected rows on next container start. Stamped via variant_reencode_revision setting, idempotent. Also: dialog CLI snippet now suggests `gh pr create --fill --label pattern` so the label gets requested directly. The labeler workflow still adds it automatically on path match; this is belt-and-suspenders. Tests: +2 in test_community_export (double-encoded repair + idempotent on clean rows). 1084 unit tests pass.
Community pattern validationPassed (12)
Validation passed. Ready for review. See |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
patterns/community/in the repo; instances opt in to a cron-driven auto-pull, and a new "Submit to community" button on each local pattern opens a prefilled GitHub PR.universal, or either side is empty). Local + imported patterns bypass the tag check.src/seed_data/sponsors_final.csv, preserving FKs (UPDATE-by-name, soft-delete orphans).See
CHANGELOG.mdfor the full per-section breakdown.Test plan
pytest tests/— 1159 passed, 4 skipped (146 new tests added)npm run lint && npm run build— green/simplifyreviewer pass — PATTERN_SOURCES constants centralized, bulk-op handler extracted, indexes promoted to SCHEMA_SQL, source filter pushed into SQL, N+1 batched in apply_manifest, set_podcast_tags short-circuits when unchanged, time helpers reuseditunes:categoryparsing populatespodcasts.tags, verify Submit-to-community opens the prefilled PR URL, verify protect-from-sync auto-engages when a community pattern is edited/code-reviewpass on this PROut of scope (deferred to a future iteration)
Per plan section 16: auto-apply variant merges in CI; per-source sync schedules; auto-cleanup of unused patterns; non-English stopword lists; LLM-suggested tags; submitter identity beyond
submitted_app_version; auto-regeneration ofindex.jsonon merge.