日本語版: README.ja.md
An offline-first analytics Metrics API built with FastAPI, DuckDB, and Parquet.
This repository is a small but production-minded backend portfolio project. It exposes a read-only HTTP API over a deterministic synthetic SaaS dataset, returns predefined analytics metrics through GET /metrics/{name}, resolves user entities through GET /users/{user_id}, and now also exposes lightweight job-run resources for operational inspection.
The project is intentionally scoped as an MVP: small enough to review quickly, but structured to demonstrate backend and analytics-engineering fundamentals that matter in real work.
What this repository is designed to show:
- Backend fundamentals: FastAPI routing, validation, status codes, error handling, typed request/response boundaries, and resource-oriented HTTP design.
- Data / analytics engineering fundamentals: deterministic sample data generation, DuckDB-based query execution, Parquet-backed local data storage, stable metric definitions, and offline regression tests.
- Engineering hygiene: reproducible local setup with
uv, offline-first CI, golden-output testing, and explicit project contracts.
This repository is part of my engineering portfolio for backend / data-oriented roles.
The goal is not to present a large feature set. The goal is to present a repo that a hiring manager or engineer can review quickly and use to confirm the following:
- I can design a small API with clear contracts.
- I can separate application, query, and data-generation concerns.
- I can make implementation choices that favor reproducibility and testability.
- I can write code and documentation that are minimal, explicit, and easy to reason about.
In other words, this project is meant to function as a compact proof of practical engineering judgment rather than as a flashy demo.
This API is intentionally designed around resource-oriented HTTP / REST-style principles.
- Resources are represented by stable paths:
GET /healthGET /metricsGET /metrics/{name}GET /jobs/runsGET /jobs/{job_name}/summaryGET /users/{user_id}
- Resource identity lives in the path (
{name},{user_id}), while filtering and windowing live in query parameters (start,end,group_by,limit). - The MVP is read-only, so
GETis the only method exposed. - The API uses conventional HTTP status codes such as
200,404, and422. - Successful responses consistently use a
data+metaresponse envelope.
This is not a “full REST maturity model”. It is a deliberate attempt to show that even a small portfolio API can respect resource boundaries, predictable URL design, and HTTP semantics.
The application reads local Parquet datasets, queries them through DuckDB, and exposes a small read-only analytics API.
It currently provides:
- predefined analytics metrics through
GET /metrics/{name} - user entity lookup through
GET /users/{user_id} - lightweight job-run resources through
GET /jobs/runsandGET /jobs/{job_name}/summary
- Dataset:
- deterministic synthetic SaaS-style events (
signup,login,checkout,cancel) - deterministic synthetic job runs for a small fixed job catalog
- deterministic synthetic SaaS-style events (
- Storage:
data/clean/events.parquetdata/clean/users.parquetdata/clean/job_runs.parquet
- Query engine: DuckDB
- API framework: FastAPI
- Main endpoints:
GET /healthGET /metricsGET /metrics/{name}GET /jobs/runsGET /jobs/{job_name}/summaryGET /users/{user_id}
dau: Daily Active Usersnew_users: count of users whose first observed event falls in the dayconversion_rate: among users with signup in the window, fraction who also have checkout in the window
These metrics are intentionally small in scope, but they map to common product and business questions:
dauis a baseline engagement KPI: “How many users are actively using the product?”new_usersis an acquisition KPI: “Are we bringing new users into the product?”conversion_rateis a funnel-efficiency KPI: “How effectively do signups turn into a downstream value event?”
Together, they provide a minimal analytics view of acquisition, engagement, and conversion.
For a fuller explanation of metric semantics, business meaning, and current limitations, see METRICS.md.
The v0.2.0 line extends the repository with a small read-only job-run layer backed by data/clean/job_runs.parquet.
These endpoints are intentionally lightweight:
-
GET /jobs/runs- list job runs within a requested date window
- optional filtering by
job_nameandstatus - derived fields such as
duration_secandschedule_delay_sec
-
GET /jobs/{job_name}/summary- return one-job aggregate statistics within a requested date window
- include counts, rates, averages, and
latest_*fields based on the latest scheduled run in the filtered window
This is not a scheduler, queue worker, or orchestration system. It is a compact operational read layer designed to show resource-oriented API design, Parquet-backed query modeling, and SQL-to-API traceability.
Synthetic generator -> Parquet dataset -> DuckDB queries -> FastAPI endpoints -> JSON responses
src/app/main.py— app factory, routes, response shapingsrc/app/warehouse.py— DuckDB query layersrc/app/metrics_catalog.py— metric definitions / allow-listssrc/app/jobs_catalog.py— fixed synthetic job definitions used for sample generationsrc/app/models.py— runtime configsrc/app/synth.py— deterministic synthetic dataset generationscripts/generate_sample.py— sample dataset generation CLItools/write_golden_params.py— golden parameter file generatortools/regenerate_golden.py— golden output regenerationtests/— offline API tests and golden comparisons
analytics-metrics-api/
pyproject.toml
uv.lock
README.md
README.ja.md
METRICS.md
METRICS.ja.md
docs/
development-highlights.ja.md
src/
app/
__init__.py
main.py
warehouse.py
metrics_catalog.py
models.py
synth.py
jobs_catalog.py
static/
index.html
styles.css
app.js
scripts/
generate_sample.py
tools/
write_golden_params.py
regenerate_golden.py
tests/
conftest.py
golden/
params.json
dau_by_day_rows.json
user_42.json
test_health.py
test_entities.py
test_metrics_list.py
test_metrics_known_value.py
test_root_page.py
.github/
workflows/
ci.yml
sql/
debug/
conversion_rate_window.sql
new_users_window.sql
dau_window_by_plan.sql
dau_window_by_day.sql
dau_window_by_country.sql
users_parquet_override_debug.sql
job/
job_setup.sql
job_runs_window.sql
job_summary_by_name.sql
job_runs_overview_by_job.sql
cli/
run_job_summary_by_name_line.duckdb
run_job_summary_by_name_csv.duckdb
run_job_runs_window.duckdb
run_job_runs_overview_by_job.duckdb
out/
.gitkeep
sql/debug/ contains manual validation queries for inspecting metric logic directly in DuckDB against the local Parquet dataset. These files are development aids only and do not replace the application queries implemented in src/app/warehouse.py.
uv sync --lockeduv run python scripts/generate_sample.py \
--seed 18790314 \
--start 2026-01-01 \
--days 7 \
--n_users 50 \
--events_per_user 3 \
--known_user_id 42This writes:
data/clean/events.parquet
data/clean/users.parquet
data/clean/job_runs.parquet
Sample generation writes both data/clean/events.parquet and data/clean/users.parquet.
GET /users/{user_id} prefers data/clean/users.parquet when present. If users.parquet is absent, or the requested user is not found there, the API falls back to values derived from the earliest event row in data/clean/events.parquet.
uv run uvicorn app.main:app --reloadHealth:
curl "http://127.0.0.1:8000/health"Metric catalog:
curl "http://127.0.0.1:8000/metrics"DAU by day:
curl "http://127.0.0.1:8000/metrics/dau?start=2026-01-01&end=2026-01-07&group_by=day&limit=365"Job runs:
curl "http://127.0.0.1:8000/jobs/runs?start=2026-01-01&end=2026-01-07&limit=100"
curl "http://127.0.0.1:8000/jobs/daily_ingest/summary?start=2026-01-01&end=2026-01-07"User entity:
curl "http://127.0.0.1:8000/users/42"{
"status": "ok",
"version": "0.1.0",
"dataset": "synthetic_saas_v0",
"warehouse": {
"duckdb": "ready",
"events_rows": 150
}
}{
"data": {
"metrics": [
{
"name": "dau",
"title": "Daily Active Users",
"description": "Unique users with any event per day.",
"supported_group_by": ["day", "country", "plan"],
"required_columns": ["event_time", "user_id"]
}
]
},
"meta": {
"dataset": "synthetic_saas_v0"
}
}{
"data": {
"user_id": 42,
"signup_time": "2026-01-01T00:37:00Z",
"country": "US",
"plan": "pro"
},
"meta": {
"dataset": "synthetic_saas_v0"
}
}A minimal browser-based demo UI is available at:
http://127.0.0.1:8000/
This page is intentionally thin. It exists as a small demo surface for the existing API and does not replace the backend/data-focused design of the project.
The browser demo currently exposes small interactive forms for:
- metric execution through
GET /metrics/{name} - user lookup through
GET /users/{user_id} - job-run listing through
GET /jobs/runs - one-job summary lookup through
GET /jobs/{job_name}/summary
This UI remains intentionally thin. It is a convenience inspection surface for the existing API and does not move business logic out of the backend.
You can still use the repository primarily through:
curl- FastAPI docs at
/docs - offline tests with
pytest
This repository is designed to be offline-first.
uv run pytestGET /healthreturns200GET /users/{user_id}returns404for missing usersGET /users/{user_id}returns stable known output for a fixed userGET /metricsreturns a stable catalog structureGET /metrics/daureturns known expected rows for a fixed windowGET /jobs/runsreturns200with stable response structure for a deterministic sample datasetGET /jobs/runsappliesjob_nameandstatusfilters correctlyGET /jobs/{job_name}/summaryreturns a stable aggregate response shape for a deterministic sample dataset- job endpoints return
503whenjob_runs.parquetis unavailable GET /returns200and serves the demo page with linked static assets
- The dataset used in tests is generated deterministically from fixed golden parameters.
- Golden files are committed as JSON rather than Parquet snapshots.
- Metric tests compare stable subsets such as
response["data"]["rows"]. - Entity tests compare
response["data"]. pytest-socketruns with sockets disabled by default, and theclientfixture enables socket access only whereTestClientrequires it.- Job-endpoint tests focus on response structure, filter behavior, and graceful handling when the job_runs dataset is unavailable, rather than on brittle full-response snapshots.
Regenerate golden outputs:
uv run python tools/regenerate_golden.pyThis repository includes a GitHub Actions workflow for offline-first validation.
CI currently runs the following checks:
uv run ruff check .uv run pytestuv run pyrefly check
The purpose of CI in this repository is to keep the project reproducible and easy to review:
- style and static issues are checked automatically
- tests run without external network dependencies
- type consistency is checked in addition to linting and tests
The goal is not to build a heavy delivery pipeline, but to show reproducible engineering discipline in a small portfolio project.
This repository also has a small public demo deployment for browser-based inspection.
- the API and the thin browser demo are served by the same FastAPI app
- the public demo is deployed as a lightweight web service rather than as a separate static site
- deployment is intentionally minimal: the goal is to make the existing read-only API easy to inspect, not to build a heavy delivery platform
- in the current setup, GitHub-backed updates to the deployed branch are reflected automatically in the public demo
This deployment layer is kept intentionally small. The main engineering signal of the repository remains the backend, query design, reproducibility, and offline-first validation.
This project avoids cloud dependencies, external APIs, and mandatory containers in the MVP. That keeps the signal focused on application structure, query logic, and reproducibility.
The dataset is generated from explicit parameters and a fixed seed. That makes the behavior inspectable and repeatable.
For a portfolio MVP, DuckDB and Parquet are enough to show local analytical querying, schema awareness, and efficient read patterns without hiding the logic behind infrastructure.
The routes are small, but they are shaped as resources with predictable semantics:
- catalog resource:
/metrics - metric resource:
/metrics/{name} - user resource:
/users/{user_id}
The tests intentionally compare the stable parts of responses so implementation details can evolve without making the suite noisy.
This repository is intentionally narrow in scope.
Not included in the current v0.1.x series:
- authentication / authorization
- caching
- background jobs / orchestration
- multi-tenant design
- write endpoints (
POST,PUT,PATCH,DELETE)
These are valid next steps, but not necessary to demonstrate the core engineering signal this portfolio repo is meant to show.
A recruiter or hiring manager should be able to scan this repository and answer the following in under a minute:
- Does this person understand how to design a small HTTP API?
- Does it separate API, query, and data-generation responsibilities?
- Does it show reproducibility and test discipline?
- Does it show appropriate scoping for an MVP portfolio project?
Possible follow-up PRs:
- add richer metric contracts and additional KPIs
- add dbt-based transformations
- add a minimal TypeScript client or UI
- expand metric documentation and data contract checks
MIT License. See LICENSE file.