twister

Modular monolith MVP for collecting tweets from X/Twitter For You, deduplicating them, and reviewing one tweet at a time over HTTP.

Stack

Python 3.12
FastAPI
Direct CDP attach to existing browser (default, no ChromeDriver needed)
Selenium (optional mode)
SQLite
Qdrant

Docs

docs/configuration.md - environment variables, defaults, and where each setting is used.
docs/api-contract.md - endpoint contract, validation rules, error semantics.
docs/operations.md - scheduler behavior, logging, troubleshooting, security notes.

Quick start

Install dependencies:

uv venv .venv
source .venv/bin/activate
uv sync --extra dev

Configure environment:

cp .env.example .env

Start your browser in remote-debug mode (CDP), login to X/Twitter, and keep it running.

Example (Linux):

brave-browser --remote-debugging-port=9222 --user-data-dir=/tmp/twister-cdp

Run API:

uv run uvicorn twister.main:app --reload

Important: app startup automatically starts background scheduler jobs:

collect cycle every POLL_INTERVAL_MIN
retention cycle every 24h

Open review UI:

http://127.0.0.1:8000/review
Optional standalone deep-link (works for any status, including rejected):
- http://127.0.0.1:8000/review?id=<local_id>&standalone=1&debug=1
Overview table:
- http://127.0.0.1:8000/overview

Manual pull (without WWW UI):

Run collect + process directly via API:

curl -sS -X POST http://127.0.0.1:8000/api/jobs/collect \
  -H 'Content-Type: application/json' \
  -d '{}'

With explicit limits:

curl -sS -X POST http://127.0.0.1:8000/api/jobs/collect \
  -H 'Content-Type: application/json' \
  -d '{"limit":20,"scroll_limit":50}'

Payload fields:

limit: max candidates to collect in one run (1..500)
scroll_limit: max feed scroll loops (1..500)
if omitted, defaults come from env: COLLECT_TARGET and SCROLL_LIMIT

Response includes:

run_id
total_candidates
accepted
rejected
duplicate_hard
duplicate_semantic
inserted

Run tests:

uv run pytest

API Reference

Base URL (local): http://127.0.0.1:8000

Auth: none (local/private deployment expected).

Health

GET /api/health
Response: {status, db, qdrant, browser_cdp}
status is ok or degraded
status is degraded when DB is failing or CDP is unreachable in BROWSER_MODE=cdp
qdrant=degraded does not by itself change overall status

Example:

curl -sS http://127.0.0.1:8000/api/health

UI pages

GET /review
- HTML review page (one tweet per view)
- Supports query params:
  - id (local tweet row id)
  - standalone=1 (hides next/prev navigation)
  - debug=1 (shows technical debug panel)
GET /pull
- HTML manual pull page
GET /overview
- HTML table overview (filter/sort/paginate)

Jobs and maintenance

POST /api/jobs/collect
- Body:
  - limit optional, 1..500
  - scroll_limit optional, 1..500
- Response:
  - run_id
  - total_candidates
  - accepted
  - rejected
  - duplicate_hard
  - duplicate_semantic
  - inserted

Example:

curl -sS -X POST http://127.0.0.1:8000/api/jobs/collect \
  -H 'Content-Type: application/json' \
  -d '{"limit":20,"scroll_limit":50}'

GET /api/jobs/rejected
- Query:
  - status required: rejected_filter | rejected_hard | rejected_semantic
  - run_id optional, >=1
  - limit optional, default 100, range 1..500
- Response: list of rejected tweets with why_shown

Example:

curl -sS "http://127.0.0.1:8000/api/jobs/rejected?status=rejected_semantic&limit=50"

POST /api/maintenance/retention/run
- Response: {deleted}

Example:

curl -sS -X POST http://127.0.0.1:8000/api/maintenance/retention/run

POST /api/maintenance/reindex
- Body:
  - recreate_collection bool, default true
  - batch_size int, default 200, range 1..5000
- Response:
  - recreated_collection
  - reindexed

Example:

curl -sS -X POST http://127.0.0.1:8000/api/maintenance/reindex \
  -H 'Content-Type: application/json' \
  -d '{"recreate_collection":true,"batch_size":200}'

POST /api/maintenance/rescore
- Body:
  - batch_size int, default 200, range 1..5000
  - include_rejected_score bool, default true
- Response:
  - processed
  - transitioned_to_accepted
  - transitioned_to_rejected_filter
  - unchanged
  - skipped_non_score_rejected
  - qdrant_upserts
  - qdrant_deletes

Example:

curl -sS -X POST http://127.0.0.1:8000/api/maintenance/rescore \
  -H 'Content-Type: application/json' \
  -d '{"batch_size":200,"include_rejected_score":true}'

Review queue

GET /api/review/current
- Query (optional):
  - id (local row id, works for any dedup_status)
- Query (optional):
  - tweet_id (X/Twitter tweet id as string)
- If id is present, returns that exact row (or 404 if not found).
- If tweet_id exists in local accepted queue, returns that exact tweet.
- Otherwise falls back to default queue logic (first unread, then latest).
- 404 when no tweets are available
GET /api/overview/tweets
- Query:
  - page optional, default 1
  - page_size optional, default 25, range 1..100
  - sort_by optional:
    - id | tweet_id | dedup_status | unread | score | source_type | author | tweet_time | ingested_at | text_len | links_count | media_count | reason | matched_on | matched_tweet_fk | semantic_similarity | text
  - sort_dir optional: asc | desc
  - dedup_status optional CSV:
    - accepted,kept,rejected_filter,rejected_hard,rejected_semantic
  - source_type optional CSV:
    - original,quote,reply,repost,unknown
  - unread optional: 1|0|true|false|yes|no|any
  - has_links optional: 1|0|true|false|yes|no|any
  - has_media optional: 1|0|true|false|yes|no|any
  - score_min optional float
  - score_max optional float
  - q optional text search (text, author, tweet_id)
- Response:
  - items[] overview rows with counts, reason fields, snippet, and raw why_shown
  - page, page_size, total_items, total_pages, sort_by, sort_dir
GET /api/review/next?tweet_id=<id>
- Query:
  - tweet_id required, >=1 (local row id in SQLite, not source X id)
- Returns next tweet in queue
- 404 when no next tweet exists
GET /api/review/prev?tweet_id=<id>
- Query:
  - tweet_id required, >=1 (local row id in SQLite, not source X id)
- Returns previous tweet in queue
- 404 when no previous tweet exists
POST /api/review/rate
- Body:
  - tweet_id required
  - rating required, -1..1
- Response: {ok:true}
- 404 when tweet does not exist
POST /api/review/mark-read
- Body:
  - tweet_id required, >=1
- Response: {ok:true}
- 404 when tweet does not exist
GET /api/review/links/{tweet_id}
- tweet_id path param is local row id in SQLite
- Returns list of link URLs
GET /api/review/media/{tweet_id}
- tweet_id path param is local row id in SQLite
- Returns list of media URLs
GET /api/review/similar
- Query:
  - tweet_id required, >=1 (local row id in SQLite, not source X id)
  - limit optional, default 5, range 1..20
  - min_similarity optional 0..1; when omitted uses REVIEW_SIM_THRESHOLD
- Returns top similar tweets with status accepted or kept:
  - id
  - similarity
  - tweet_id
  - permalink
  - author
  - tweet_time
  - text
- 404 when base tweet does not exist

Example:

curl -sS "http://127.0.0.1:8000/api/review/similar?tweet_id=123&limit=5&min_similarity=0.75"

Tags and rating-weight learning

Tagging and score learning are fully local (SQLite), with no extra fetch after collect.

Flow:

During processing, SelectService matches tweet text against keyword dictionaries (TAG_KEYWORDS).
Matched tags are saved to why_shown_json.matched_tags.
For accepted tweets, tags are persisted in SQLite:

tag_dictionary (unique tag names)
tweet_tags (tweet-to-tag mapping)

Initial ranking score is:

if tags are matched:
- base_score = number_of_matched_tags
if no tags are matched:
- base_score = 0.1
weight_bonus = sum(current_tag_weight for matched tags)
score_before_length_penalty = base_score + weight_bonus
short_text_penalty = 2.0 * ((max(0, GOOD_TEXT_MIN_CHARS - effective_text_length) / GOOD_TEXT_MIN_CHARS) ^ 2)
final_score = score_before_length_penalty - short_text_penalty

Length penalty details:

GOOD_TEXT_MIN_CHARS (env, default 100) defines "good" minimum length.
effective_text_length is measured after removing URLs and collapsing whitespace.
Penalty is quadratic, so very short tweets are punished stronger than linear.

Score gate:

MIN_SHOW_SCORE (env, default 0.0) controls if tweet is shown:
- final_score >= MIN_SHOW_SCORE -> tweet stays accepted
- final_score < MIN_SHOW_SCORE -> tweet is stored as rejected_filter with reason rejected_score

On review rating (-1, 0, +1), the app:

appends event to review_events
updates tweet row (tweets.score, tweets.unread)
updates weights in tag_weights for all tags assigned to that tweet

Weight update rule:

new_weight = (1 - alpha) * current_weight + alpha * rating
current alpha is 0.2
if tag has no previous weight, first value is set to rating

Practical effect:

positive ratings increase weights of related tags
negative ratings decrease weights of related tags
next collect cycles use updated weights in ranking (via weight_bonus)

Important notes

This service does not use Twitter/X API.
CDP mode stays supported even if Selenium mode is enabled later.
In BROWSER_MODE=cdp, collector uses Playwright connect_over_cdp.
Media URLs are stored as base assets and rendered with TWITTER_MEDIA_FORMAT / TWITTER_MEDIA_NAME from .env.
Collector depends on browser session and page structure (For You).
Running API starts scheduler jobs automatically (collect + retention).
If Qdrant is unavailable/disabled, semantic dedup and similarity become no-op (app still runs).
EMBEDDING_OFFLINE=true requires model cache to exist locally; for first download, set EMBEDDING_OFFLINE=false temporarily.
APP_HOST / APP_PORT are currently config placeholders and are not wired to uvicorn startup.
Semantic embeddings backend:
- EMBEDDING_BACKEND=tweetnlp
- EMBEDDING_MODEL_NAME=cambridgeltl/tweet-roberta-base-embeddings-v1
- EMBEDDING_OFFLINE=true (no outbound HF requests after model is cached)
- EMBEDDING_DIM=0 (auto-detect from model; recommended)

Compare embedding quality locally

Run A/B benchmark on your existing SQLite data (uses previously collected rejected_hard and rejected_semantic as positive pairs):

uv run python -m twister.tools.benchmark_embeddings \
  --sqlite-path data/twister.db \
  --model "hash:256" \
  --model "tweetnlp:cambridgeltl/tweet-roberta-base-embeddings-v1" \
  --model "sbert:sentence-transformers/all-MiniLM-L6-v2" \
  --pos-limit 500 \
  --neg-limit 1000 \
  --dedup-threshold 0.92 \
  --json-out data/embedding_benchmark.json

Output metrics per model:

auc
best_threshold, best_f1
precision/recall/F1 at current dedup-threshold
runtime

If you want to benchmark sbert:* models, install:

uv add sentence-transformers

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
docs		docs
src/twister		src/twister
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

twister

Stack

Docs

Quick start

API Reference

Health

UI pages

Jobs and maintenance

Review queue

Tags and rating-weight learning

Important notes

Compare embedding quality locally

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

twister

Stack

Docs

Quick start

API Reference

Health

UI pages

Jobs and maintenance

Review queue

Tags and rating-weight learning

Important notes

Compare embedding quality locally

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages