Skip to content

⚡ Bolt: Optimize telemetry pipeline and fix state cache regression#326

Open
heidi-dang wants to merge 1 commit into
feat/bootstrap-scaffoldfrom
bolt-telemetry-optimization-v2-17046132322638879759
Open

⚡ Bolt: Optimize telemetry pipeline and fix state cache regression#326
heidi-dang wants to merge 1 commit into
feat/bootstrap-scaffoldfrom
bolt-telemetry-optimization-v2-17046132322638879759

Conversation

@heidi-dang
Copy link
Copy Markdown
Owner

💡 What: Optimized the telemetry pipeline by implementing a thread-safe cache for pricing configuration and improving event flushing I/O. Also fixed a critical NameError in the state retrieval logic.

🎯 Why: High-frequency operations like token tracking and event flushing were performing redundant disk I/O and JSON parsing, impacting overall pipeline efficiency. Additionally, a regression in the state cache logic was causing application crashes.

📊 Impact:

  • Resolves a critical crash in get_state.
  • ~2.5x speedup for load_pricing_config lookups (measured 2.4s vs 6.0s for 100k iterations).
  • Reduced overhead in the telemetry event bus.
  • Compliance with modern Python (3.12+) datetime standards.

🔬 Measurement:

  • Verified with pytest tests/test_telemetry_cache.py.
  • Benchmarked cache impact with a dedicated script.
  • Linted with ruff check.

PR created automatically by Jules for task 17046132322638879759 started by @heidi-dang

This commit implements several performance improvements and a critical bug fix in the telemetry module:

1. 🐞 **Fix critical NameError**: Resolved a regression in `get_state` where an undefined `target_run_id` caused crashes on cache misses.
2. ⚡ **Pricing Cache**: Implemented a thread-safe module-level cache for `load_pricing_config` with a 5.0s TTL. This eliminates redundant disk I/O and JSON parsing during high-frequency token tracking, yielding a ~2.5x speedup in benchmarks.
3. ⚡ **Flush Optimization**: Optimized `flush_events` by using `f.writelines()` with a generator expression, reducing Python-to-C overhead during disk writes.
4. 🛠️ **Modernized Timestamps**: Replaced deprecated `datetime.utcnow()` with `datetime.now(timezone.utc)` for Python 3.12+ compatibility.

Verified with unit tests (`pytest`) and micro-benchmarks. No breaking changes introduced.
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several performance optimizations for the telemetry engine, including a thread-safe module-level cache for pricing configurations and the use of writelines for more efficient event logging. It also updates timestamp handling to use timezone-aware datetime objects. A potential race condition in get_run_id() was identified, which should be addressed to ensure consistency across high-frequency calls.

Comment thread heidi_engine/telemetry.py

# Check for pricing config file
pricing_file = (
Path(PRICING_CONFIG_PATH) if PRICING_CONFIG_PATH else get_run_dir() / "pricing.json"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The call to get_run_dir() eventually invokes get_run_id(), which has a race condition when initializing the global RUN_ID (lines 440-446). While load_pricing_config is protected by _pricing_lock, other high-frequency functions like emit_event call get_run_id() without this lock. This could lead to multiple threads generating different run IDs if they hit the initialization path simultaneously. Given the focus on thread-safety in this PR, get_run_id() should be updated to use a lock for its initialization logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant