diff --git a/.claude/scheduled_tasks.lock b/.claude/scheduled_tasks.lock new file mode 100644 index 0000000..fb7a2f6 --- /dev/null +++ b/.claude/scheduled_tasks.lock @@ -0,0 +1 @@ +{"sessionId":"7555a767-0d96-4490-86d6-a13b5c13148b","pid":40413,"procStart":"Sun May 3 16:45:03 2026","acquiredAt":1777917963204} \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md index 9682c3f..c6dd7e7 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -111,7 +111,7 @@ RepoRefreshWorker (hourly) — re-fetches passthrough repos by oldest indexed_a - **Auth is a stateless proxy, not a session.** `/v1/auth/device/*` forwards to `github.com/login/*` with the backend's `GITHUB_OAUTH_CLIENT_ID` injected. The backend must **never** log, cache, or persist the access token returned by a successful poll — it passes through the suspending handler and out to the HTTP response, nothing else. No database table, no in-memory map, no breadcrumb. The client is the only place the token lives. Client is backend-first on these two calls and falls back to direct-to-github.com on 5xx / network errors (only — not on valid-but-negative responses like `authorization_pending` or `access_denied`, which are GitHub's real answer and `github.com` direct would say the same thing). - **Unified ranking via `SearchScore.compute()`** (`ranking/SearchScore.kt`). Formula: `0.40·log₁₀(stars+1)/6 + 0.30·ctr + 0.20·install_success_rate + 0.10·exp(-days_since_release/90)`. Two callers: `SignalAggregationWorker` (hourly, with real signals) and `GitHubSearchClient` at ingest time (cold-start, signals = 0 — still gives passthrough repos a non-null score so they sort). Weights live in the object only; never inline the formula elsewhere. - **Meilisearch partial-update gotcha — PUT, never POST.** `MeilisearchClient.addDocuments()` is POST, which on Meili *replaces* the document with whatever fields you send (everything else becomes null). `MeilisearchClient.updateScores()` is PUT, which merges. Pushing just `{id, search_score}` with POST will wipe every other field on 3000+ docs. If you add a new "partial update" path, verify the HTTP verb before deploying. -- **Dynamic category/topic ordering.** `RepoRepository.findByCategory()` / `findByTopicBucket()` sort by `searchScore DESC NULLS LAST, rank ASC`. The Python fetcher's static `rank` is only a tie-breaker now; behavioral signals dominate. +- **Dynamic category/topic ordering.** `RepoRepository.findByCategory()` picks a category-specific primary sort column (`trending_score` for trending, `popularity_score` for most-popular, `latest_release_date` for new-releases), falls back to global `searchScore`, then static `rank` as final tie-breaker. Without category-specific primary, both trending and most-popular collapse onto the same global score — the bug fix in PR #12. `findByTopicBucket()` keeps the simpler `searchScore DESC NULLS LAST, rank ASC` order because topics are flat lists, not flavour-segmented like the categories. - **Exposed `Repos` table uses `array("topics", TextColumnType())`** for the Postgres `TEXT[]` column. The Python fetcher writes these via psycopg2's automatic list-to-array conversion. - **Cache headers are set per endpoint**, not globally. Announcements: 600s/3600s. Categories/topics: 60s/600s. Repo detail: 30s/300s. Search: 15s/30s. Readme proxy: 3600s/21600s. User proxy: 86400s/604800s. Badges (fresh): 3600s/3600s with `stale-while-revalidate=86400`; (degraded) 300s/300s. Edge respects `s-maxage`; the larger `s-maxage` lets Gcore's shield/tiered cache topology absorb origin load while browsers stay fresher via the smaller `max-age`. `/internal/metrics` is uncached. - **HEAD routes to GET** via the `AutoHeadResponse` plugin (`Plugins.kt`). Without it, Ktor 3 returns 404 for HEAD on every GET handler — confusing for `curl -I`, monitoring, and CDN origin probes. diff --git a/src/main/kotlin/zed/rainxch/githubstore/db/RepoRepository.kt b/src/main/kotlin/zed/rainxch/githubstore/db/RepoRepository.kt index 393a1fd..808217b 100644 --- a/src/main/kotlin/zed/rainxch/githubstore/db/RepoRepository.kt +++ b/src/main/kotlin/zed/rainxch/githubstore/db/RepoRepository.kt @@ -19,18 +19,32 @@ class RepoRepository { } suspend fun findByCategory(category: String, platform: String, limit: Int = 50): List = newSuspendedTransaction(Dispatchers.IO) { - // Primary: dynamic behavioral search_score (updated hourly by - // SignalAggregationWorker from clicks / installs / stars / freshness). - // Tie-breaker: the static rank the Python fetcher writes once a day, - // which preserves the category's semantic flavor (trending stays - // velocity-flavored, new-releases stays recency-flavored, etc.) when - // two repos have similar behavioral scores. + // Primary sort is category-specific: trending velocity for the + // trending list, absolute popularity for the popular list, release + // recency for new-releases. Without category-specific primary, both + // trending and most-popular collapse onto the same global + // search_score and return ~99% identical top-N results -- the bug + // this query previously had. + // + // Each category falls back to the global behavioral search_score + // when its category-specific column is NULL, then to the static + // rank the Python fetcher writes once a day. The fetcher populates + // the category-specific scores for repos in that category, so the + // fallback is mostly a no-op except for newly-ingested rows that + // haven't been reranked yet. + val primary: org.jetbrains.exposed.sql.Expression<*> = when (category) { + "trending" -> Repos.trendingScore + "most-popular" -> Repos.popularityScore + "new-releases" -> Repos.latestReleaseDate + else -> Repos.searchScore + } Repos.innerJoin(RepoCategories, { id }, { repoId }) .selectAll() .where { (RepoCategories.category eq category) and (RepoCategories.platform eq platform) } .orderBy( + primary to SortOrder.DESC_NULLS_LAST, Repos.searchScore to SortOrder.DESC_NULLS_LAST, RepoCategories.rank to SortOrder.ASC, )