Skip to content

hipotures/twister

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

twister

Modular monolith MVP for collecting tweets from X/Twitter For You, deduplicating them, and reviewing one tweet at a time over HTTP.

Stack

  • Python 3.12
  • FastAPI
  • Direct CDP attach to existing browser (default, no ChromeDriver needed)
  • Selenium (optional mode)
  • SQLite
  • Qdrant

Docs

  • docs/configuration.md - environment variables, defaults, and where each setting is used.
  • docs/api-contract.md - endpoint contract, validation rules, error semantics.
  • docs/operations.md - scheduler behavior, logging, troubleshooting, security notes.

Quick start

  1. Install dependencies:
uv venv .venv
source .venv/bin/activate
uv sync --extra dev
  1. Configure environment:
cp .env.example .env
  1. Start your browser in remote-debug mode (CDP), login to X/Twitter, and keep it running.

Example (Linux):

brave-browser --remote-debugging-port=9222 --user-data-dir=/tmp/twister-cdp
  1. Run API:
uv run uvicorn twister.main:app --reload

Important: app startup automatically starts background scheduler jobs:

  • collect cycle every POLL_INTERVAL_MIN
  • retention cycle every 24h
  1. Open review UI:
  • http://127.0.0.1:8000/review
  • Optional standalone deep-link (works for any status, including rejected):
    • http://127.0.0.1:8000/review?id=<local_id>&standalone=1&debug=1
  • Overview table:
    • http://127.0.0.1:8000/overview
  1. Manual pull (without WWW UI):

Run collect + process directly via API:

curl -sS -X POST http://127.0.0.1:8000/api/jobs/collect \
  -H 'Content-Type: application/json' \
  -d '{}'

With explicit limits:

curl -sS -X POST http://127.0.0.1:8000/api/jobs/collect \
  -H 'Content-Type: application/json' \
  -d '{"limit":20,"scroll_limit":50}'

Payload fields:

  • limit: max candidates to collect in one run (1..500)
  • scroll_limit: max feed scroll loops (1..500)
  • if omitted, defaults come from env: COLLECT_TARGET and SCROLL_LIMIT

Response includes:

  • run_id
  • total_candidates
  • accepted
  • rejected
  • duplicate_hard
  • duplicate_semantic
  • inserted
  1. Run tests:
uv run pytest

API Reference

Base URL (local): http://127.0.0.1:8000

Auth: none (local/private deployment expected).

Health

  • GET /api/health
  • Response: {status, db, qdrant, browser_cdp}
  • status is ok or degraded
  • status is degraded when DB is failing or CDP is unreachable in BROWSER_MODE=cdp
  • qdrant=degraded does not by itself change overall status

Example:

curl -sS http://127.0.0.1:8000/api/health

UI pages

  • GET /review
    • HTML review page (one tweet per view)
    • Supports query params:
      • id (local tweet row id)
      • standalone=1 (hides next/prev navigation)
      • debug=1 (shows technical debug panel)
  • GET /pull
    • HTML manual pull page
  • GET /overview
    • HTML table overview (filter/sort/paginate)

Jobs and maintenance

  • POST /api/jobs/collect
    • Body:
      • limit optional, 1..500
      • scroll_limit optional, 1..500
    • Response:
      • run_id
      • total_candidates
      • accepted
      • rejected
      • duplicate_hard
      • duplicate_semantic
      • inserted

Example:

curl -sS -X POST http://127.0.0.1:8000/api/jobs/collect \
  -H 'Content-Type: application/json' \
  -d '{"limit":20,"scroll_limit":50}'
  • GET /api/jobs/rejected
    • Query:
      • status required: rejected_filter | rejected_hard | rejected_semantic
      • run_id optional, >=1
      • limit optional, default 100, range 1..500
    • Response: list of rejected tweets with why_shown

Example:

curl -sS "http://127.0.0.1:8000/api/jobs/rejected?status=rejected_semantic&limit=50"
  • POST /api/maintenance/retention/run
    • Response: {deleted}

Example:

curl -sS -X POST http://127.0.0.1:8000/api/maintenance/retention/run
  • POST /api/maintenance/reindex
    • Body:
      • recreate_collection bool, default true
      • batch_size int, default 200, range 1..5000
    • Response:
      • recreated_collection
      • reindexed

Example:

curl -sS -X POST http://127.0.0.1:8000/api/maintenance/reindex \
  -H 'Content-Type: application/json' \
  -d '{"recreate_collection":true,"batch_size":200}'
  • POST /api/maintenance/rescore
    • Body:
      • batch_size int, default 200, range 1..5000
      • include_rejected_score bool, default true
    • Response:
      • processed
      • transitioned_to_accepted
      • transitioned_to_rejected_filter
      • unchanged
      • skipped_non_score_rejected
      • qdrant_upserts
      • qdrant_deletes

Example:

curl -sS -X POST http://127.0.0.1:8000/api/maintenance/rescore \
  -H 'Content-Type: application/json' \
  -d '{"batch_size":200,"include_rejected_score":true}'

Review queue

  • GET /api/review/current

    • Query (optional):
      • id (local row id, works for any dedup_status)
    • Query (optional):
      • tweet_id (X/Twitter tweet id as string)
    • If id is present, returns that exact row (or 404 if not found).
    • If tweet_id exists in local accepted queue, returns that exact tweet.
    • Otherwise falls back to default queue logic (first unread, then latest).
    • 404 when no tweets are available
  • GET /api/overview/tweets

    • Query:
      • page optional, default 1
      • page_size optional, default 25, range 1..100
      • sort_by optional:
        • id | tweet_id | dedup_status | unread | score | source_type | author | tweet_time | ingested_at | text_len | links_count | media_count | reason | matched_on | matched_tweet_fk | semantic_similarity | text
      • sort_dir optional: asc | desc
      • dedup_status optional CSV:
        • accepted,kept,rejected_filter,rejected_hard,rejected_semantic
      • source_type optional CSV:
        • original,quote,reply,repost,unknown
      • unread optional: 1|0|true|false|yes|no|any
      • has_links optional: 1|0|true|false|yes|no|any
      • has_media optional: 1|0|true|false|yes|no|any
      • score_min optional float
      • score_max optional float
      • q optional text search (text, author, tweet_id)
    • Response:
      • items[] overview rows with counts, reason fields, snippet, and raw why_shown
      • page, page_size, total_items, total_pages, sort_by, sort_dir
  • GET /api/review/next?tweet_id=<id>

    • Query:
      • tweet_id required, >=1 (local row id in SQLite, not source X id)
    • Returns next tweet in queue
    • 404 when no next tweet exists
  • GET /api/review/prev?tweet_id=<id>

    • Query:
      • tweet_id required, >=1 (local row id in SQLite, not source X id)
    • Returns previous tweet in queue
    • 404 when no previous tweet exists
  • POST /api/review/rate

    • Body:
      • tweet_id required
      • rating required, -1..1
    • Response: {ok:true}
    • 404 when tweet does not exist
  • POST /api/review/mark-read

    • Body:
      • tweet_id required, >=1
    • Response: {ok:true}
    • 404 when tweet does not exist
  • GET /api/review/links/{tweet_id}

    • tweet_id path param is local row id in SQLite
    • Returns list of link URLs
  • GET /api/review/media/{tweet_id}

    • tweet_id path param is local row id in SQLite
    • Returns list of media URLs
  • GET /api/review/similar

    • Query:
      • tweet_id required, >=1 (local row id in SQLite, not source X id)
      • limit optional, default 5, range 1..20
      • min_similarity optional 0..1; when omitted uses REVIEW_SIM_THRESHOLD
    • Returns top similar tweets with status accepted or kept:
      • id
      • similarity
      • tweet_id
      • permalink
      • author
      • tweet_time
      • text
    • 404 when base tweet does not exist

Example:

curl -sS "http://127.0.0.1:8000/api/review/similar?tweet_id=123&limit=5&min_similarity=0.75"

Tags and rating-weight learning

Tagging and score learning are fully local (SQLite), with no extra fetch after collect.

Flow:

  1. During processing, SelectService matches tweet text against keyword dictionaries (TAG_KEYWORDS).
  2. Matched tags are saved to why_shown_json.matched_tags.
  3. For accepted tweets, tags are persisted in SQLite:
  • tag_dictionary (unique tag names)
  • tweet_tags (tweet-to-tag mapping)
  1. Initial ranking score is:
  • if tags are matched:
    • base_score = number_of_matched_tags
  • if no tags are matched:
    • base_score = 0.1
  • weight_bonus = sum(current_tag_weight for matched tags)
  • score_before_length_penalty = base_score + weight_bonus
  • short_text_penalty = 2.0 * ((max(0, GOOD_TEXT_MIN_CHARS - effective_text_length) / GOOD_TEXT_MIN_CHARS) ^ 2)
  • final_score = score_before_length_penalty - short_text_penalty
  1. Length penalty details:
  • GOOD_TEXT_MIN_CHARS (env, default 100) defines "good" minimum length.
  • effective_text_length is measured after removing URLs and collapsing whitespace.
  • Penalty is quadratic, so very short tweets are punished stronger than linear.
  1. Score gate:
  • MIN_SHOW_SCORE (env, default 0.0) controls if tweet is shown:
    • final_score >= MIN_SHOW_SCORE -> tweet stays accepted
    • final_score < MIN_SHOW_SCORE -> tweet is stored as rejected_filter with reason rejected_score
  1. On review rating (-1, 0, +1), the app:
  • appends event to review_events
  • updates tweet row (tweets.score, tweets.unread)
  • updates weights in tag_weights for all tags assigned to that tweet

Weight update rule:

  • new_weight = (1 - alpha) * current_weight + alpha * rating
  • current alpha is 0.2
  • if tag has no previous weight, first value is set to rating

Practical effect:

  • positive ratings increase weights of related tags
  • negative ratings decrease weights of related tags
  • next collect cycles use updated weights in ranking (via weight_bonus)

Important notes

  • This service does not use Twitter/X API.
  • CDP mode stays supported even if Selenium mode is enabled later.
  • In BROWSER_MODE=cdp, collector uses Playwright connect_over_cdp.
  • Media URLs are stored as base assets and rendered with TWITTER_MEDIA_FORMAT / TWITTER_MEDIA_NAME from .env.
  • Collector depends on browser session and page structure (For You).
  • Running API starts scheduler jobs automatically (collect + retention).
  • If Qdrant is unavailable/disabled, semantic dedup and similarity become no-op (app still runs).
  • EMBEDDING_OFFLINE=true requires model cache to exist locally; for first download, set EMBEDDING_OFFLINE=false temporarily.
  • APP_HOST / APP_PORT are currently config placeholders and are not wired to uvicorn startup.
  • Semantic embeddings backend:
    • EMBEDDING_BACKEND=tweetnlp
    • EMBEDDING_MODEL_NAME=cambridgeltl/tweet-roberta-base-embeddings-v1
    • EMBEDDING_OFFLINE=true (no outbound HF requests after model is cached)
    • EMBEDDING_DIM=0 (auto-detect from model; recommended)

Compare embedding quality locally

Run A/B benchmark on your existing SQLite data (uses previously collected rejected_hard and rejected_semantic as positive pairs):

uv run python -m twister.tools.benchmark_embeddings \
  --sqlite-path data/twister.db \
  --model "hash:256" \
  --model "tweetnlp:cambridgeltl/tweet-roberta-base-embeddings-v1" \
  --model "sbert:sentence-transformers/all-MiniLM-L6-v2" \
  --pos-limit 500 \
  --neg-limit 1000 \
  --dedup-threshold 0.92 \
  --json-out data/embedding_benchmark.json

Output metrics per model:

  • auc
  • best_threshold, best_f1
  • precision/recall/F1 at current dedup-threshold
  • runtime

If you want to benchmark sbert:* models, install:

uv add sentence-transformers

About

Modular monolith MVP for collecting tweets from X/Twitter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors