Modular monolith MVP for collecting tweets from X/Twitter For You, deduplicating them, and reviewing one tweet at a time over HTTP.
- Python 3.12
- FastAPI
- Direct CDP attach to existing browser (default, no ChromeDriver needed)
- Selenium (optional mode)
- SQLite
- Qdrant
docs/configuration.md- environment variables, defaults, and where each setting is used.docs/api-contract.md- endpoint contract, validation rules, error semantics.docs/operations.md- scheduler behavior, logging, troubleshooting, security notes.
- Install dependencies:
uv venv .venv
source .venv/bin/activate
uv sync --extra dev- Configure environment:
cp .env.example .env- Start your browser in remote-debug mode (CDP), login to X/Twitter, and keep it running.
Example (Linux):
brave-browser --remote-debugging-port=9222 --user-data-dir=/tmp/twister-cdp- Run API:
uv run uvicorn twister.main:app --reloadImportant: app startup automatically starts background scheduler jobs:
- collect cycle every
POLL_INTERVAL_MIN - retention cycle every 24h
- Open review UI:
http://127.0.0.1:8000/review- Optional standalone deep-link (works for any status, including rejected):
http://127.0.0.1:8000/review?id=<local_id>&standalone=1&debug=1
- Overview table:
http://127.0.0.1:8000/overview
- Manual pull (without WWW UI):
Run collect + process directly via API:
curl -sS -X POST http://127.0.0.1:8000/api/jobs/collect \
-H 'Content-Type: application/json' \
-d '{}'With explicit limits:
curl -sS -X POST http://127.0.0.1:8000/api/jobs/collect \
-H 'Content-Type: application/json' \
-d '{"limit":20,"scroll_limit":50}'Payload fields:
limit: max candidates to collect in one run (1..500)scroll_limit: max feed scroll loops (1..500)- if omitted, defaults come from env:
COLLECT_TARGETandSCROLL_LIMIT
Response includes:
run_idtotal_candidatesacceptedrejectedduplicate_hardduplicate_semanticinserted
- Run tests:
uv run pytestBase URL (local): http://127.0.0.1:8000
Auth: none (local/private deployment expected).
GET /api/health- Response:
{status, db, qdrant, browser_cdp} statusisokordegradedstatusis degraded when DB is failing or CDP is unreachable inBROWSER_MODE=cdpqdrant=degradeddoes not by itself change overallstatus
Example:
curl -sS http://127.0.0.1:8000/api/healthGET /review- HTML review page (one tweet per view)
- Supports query params:
id(local tweet row id)standalone=1(hides next/prev navigation)debug=1(shows technical debug panel)
GET /pull- HTML manual pull page
GET /overview- HTML table overview (filter/sort/paginate)
POST /api/jobs/collect- Body:
limitoptional,1..500scroll_limitoptional,1..500
- Response:
run_idtotal_candidatesacceptedrejectedduplicate_hardduplicate_semanticinserted
- Body:
Example:
curl -sS -X POST http://127.0.0.1:8000/api/jobs/collect \
-H 'Content-Type: application/json' \
-d '{"limit":20,"scroll_limit":50}'GET /api/jobs/rejected- Query:
statusrequired:rejected_filter | rejected_hard | rejected_semanticrun_idoptional,>=1limitoptional, default100, range1..500
- Response: list of rejected tweets with
why_shown
- Query:
Example:
curl -sS "http://127.0.0.1:8000/api/jobs/rejected?status=rejected_semantic&limit=50"POST /api/maintenance/retention/run- Response:
{deleted}
- Response:
Example:
curl -sS -X POST http://127.0.0.1:8000/api/maintenance/retention/runPOST /api/maintenance/reindex- Body:
recreate_collectionbool, defaulttruebatch_sizeint, default200, range1..5000
- Response:
recreated_collectionreindexed
- Body:
Example:
curl -sS -X POST http://127.0.0.1:8000/api/maintenance/reindex \
-H 'Content-Type: application/json' \
-d '{"recreate_collection":true,"batch_size":200}'POST /api/maintenance/rescore- Body:
batch_sizeint, default200, range1..5000include_rejected_scorebool, defaulttrue
- Response:
processedtransitioned_to_acceptedtransitioned_to_rejected_filterunchangedskipped_non_score_rejectedqdrant_upsertsqdrant_deletes
- Body:
Example:
curl -sS -X POST http://127.0.0.1:8000/api/maintenance/rescore \
-H 'Content-Type: application/json' \
-d '{"batch_size":200,"include_rejected_score":true}'-
GET /api/review/current- Query (optional):
id(local row id, works for anydedup_status)
- Query (optional):
tweet_id(X/Twitter tweet id as string)
- If
idis present, returns that exact row (or 404 if not found). - If
tweet_idexists in local accepted queue, returns that exact tweet. - Otherwise falls back to default queue logic (
first unread, thenlatest). 404when no tweets are available
- Query (optional):
-
GET /api/overview/tweets- Query:
pageoptional, default1page_sizeoptional, default25, range1..100sort_byoptional:id | tweet_id | dedup_status | unread | score | source_type | author | tweet_time | ingested_at | text_len | links_count | media_count | reason | matched_on | matched_tweet_fk | semantic_similarity | text
sort_diroptional:asc | descdedup_statusoptional CSV:accepted,kept,rejected_filter,rejected_hard,rejected_semantic
source_typeoptional CSV:original,quote,reply,repost,unknown
unreadoptional:1|0|true|false|yes|no|anyhas_linksoptional:1|0|true|false|yes|no|anyhas_mediaoptional:1|0|true|false|yes|no|anyscore_minoptional floatscore_maxoptional floatqoptional text search (text,author,tweet_id)
- Response:
items[]overview rows with counts, reason fields, snippet, and rawwhy_shownpage,page_size,total_items,total_pages,sort_by,sort_dir
- Query:
-
GET /api/review/next?tweet_id=<id>- Query:
tweet_idrequired,>=1(local row id in SQLite, not source X id)
- Returns next tweet in queue
404when no next tweet exists
- Query:
-
GET /api/review/prev?tweet_id=<id>- Query:
tweet_idrequired,>=1(local row id in SQLite, not source X id)
- Returns previous tweet in queue
404when no previous tweet exists
- Query:
-
POST /api/review/rate- Body:
tweet_idrequiredratingrequired,-1..1
- Response:
{ok:true} 404when tweet does not exist
- Body:
-
POST /api/review/mark-read- Body:
tweet_idrequired,>=1
- Response:
{ok:true} 404when tweet does not exist
- Body:
-
GET /api/review/links/{tweet_id}tweet_idpath param is local row id in SQLite- Returns list of link URLs
-
GET /api/review/media/{tweet_id}tweet_idpath param is local row id in SQLite- Returns list of media URLs
-
GET /api/review/similar- Query:
tweet_idrequired,>=1(local row id in SQLite, not source X id)limitoptional, default5, range1..20min_similarityoptional0..1; when omitted usesREVIEW_SIM_THRESHOLD
- Returns top similar tweets with status
acceptedorkept:idsimilaritytweet_idpermalinkauthortweet_timetext
404when base tweet does not exist
- Query:
Example:
curl -sS "http://127.0.0.1:8000/api/review/similar?tweet_id=123&limit=5&min_similarity=0.75"Tagging and score learning are fully local (SQLite), with no extra fetch after collect.
Flow:
- During processing,
SelectServicematches tweet text against keyword dictionaries (TAG_KEYWORDS). - Matched tags are saved to
why_shown_json.matched_tags. - For accepted tweets, tags are persisted in SQLite:
tag_dictionary(unique tag names)tweet_tags(tweet-to-tag mapping)
- Initial ranking score is:
- if tags are matched:
base_score = number_of_matched_tags
- if no tags are matched:
base_score = 0.1
weight_bonus = sum(current_tag_weight for matched tags)score_before_length_penalty = base_score + weight_bonusshort_text_penalty = 2.0 * ((max(0, GOOD_TEXT_MIN_CHARS - effective_text_length) / GOOD_TEXT_MIN_CHARS) ^ 2)final_score = score_before_length_penalty - short_text_penalty
- Length penalty details:
GOOD_TEXT_MIN_CHARS(env, default100) defines "good" minimum length.effective_text_lengthis measured after removing URLs and collapsing whitespace.- Penalty is quadratic, so very short tweets are punished stronger than linear.
- Score gate:
MIN_SHOW_SCORE(env, default0.0) controls if tweet is shown:final_score >= MIN_SHOW_SCORE-> tweet staysacceptedfinal_score < MIN_SHOW_SCORE-> tweet is stored asrejected_filterwith reasonrejected_score
- On review rating (
-1,0,+1), the app:
- appends event to
review_events - updates tweet row (
tweets.score,tweets.unread) - updates weights in
tag_weightsfor all tags assigned to that tweet
Weight update rule:
new_weight = (1 - alpha) * current_weight + alpha * rating- current
alphais0.2 - if tag has no previous weight, first value is set to
rating
Practical effect:
- positive ratings increase weights of related tags
- negative ratings decrease weights of related tags
- next collect cycles use updated weights in ranking (via
weight_bonus)
- This service does not use Twitter/X API.
- CDP mode stays supported even if Selenium mode is enabled later.
- In
BROWSER_MODE=cdp, collector uses Playwrightconnect_over_cdp. - Media URLs are stored as base assets and rendered with
TWITTER_MEDIA_FORMAT/TWITTER_MEDIA_NAMEfrom.env. - Collector depends on browser session and page structure (
For You). - Running API starts scheduler jobs automatically (collect + retention).
- If Qdrant is unavailable/disabled, semantic dedup and similarity become no-op (app still runs).
EMBEDDING_OFFLINE=truerequires model cache to exist locally; for first download, setEMBEDDING_OFFLINE=falsetemporarily.APP_HOST/APP_PORTare currently config placeholders and are not wired touvicornstartup.- Semantic embeddings backend:
EMBEDDING_BACKEND=tweetnlpEMBEDDING_MODEL_NAME=cambridgeltl/tweet-roberta-base-embeddings-v1EMBEDDING_OFFLINE=true(no outbound HF requests after model is cached)EMBEDDING_DIM=0(auto-detect from model; recommended)
Run A/B benchmark on your existing SQLite data (uses previously collected rejected_hard and rejected_semantic as positive pairs):
uv run python -m twister.tools.benchmark_embeddings \
--sqlite-path data/twister.db \
--model "hash:256" \
--model "tweetnlp:cambridgeltl/tweet-roberta-base-embeddings-v1" \
--model "sbert:sentence-transformers/all-MiniLM-L6-v2" \
--pos-limit 500 \
--neg-limit 1000 \
--dedup-threshold 0.92 \
--json-out data/embedding_benchmark.jsonOutput metrics per model:
aucbest_threshold,best_f1- precision/recall/F1 at current
dedup-threshold - runtime
If you want to benchmark sbert:* models, install:
uv add sentence-transformers