Skip to content

Comments

fix(object_store): prioritize client-ordered buckets and correct decay#76

Merged
shikhar merged 4 commits intomainfrom
fix/issue-43-error-rate-decay
Feb 19, 2026
Merged

fix(object_store): prioritize client-ordered buckets and correct decay#76
shikhar merged 4 commits intomainfrom
fix/issue-43-error-rate-decay

Conversation

@shikhar
Copy link
Member

@shikhar shikhar commented Feb 19, 2026

Summary

  • fix error-rate observation to use only time-based decay (remove per-observation decay)
  • clamp bucket error-rate state to 1.0 so the metric remains a true probability
  • rebalance bucket scoring so reliability penalties dominate latency for failing/circuit-open buckets
  • increase position penalty to strongly prefer client-provided bucket order (region/AZ locality)
  • preserve client order when scores tie by using index as a deterministic tie-break
  • update and expand stats.rs tests for the new scoring and decay semantics

Testing

  • cargo +nightly fmt
  • cargo clippy --all-features --all-targets -- -D warnings --allow deprecated
  • cargo nextest run

Closes #43
Closes #68

@shikhar shikhar marked this pull request as ready for review February 19, 2026 04:50
@greptile-apps
Copy link

greptile-apps bot commented Feb 19, 2026

Greptile Summary

This PR fixes the bucket scoring algorithm to better prioritize client-ordered buckets and healthy buckets over failing ones. The key changes are:

  • Error rate decay: Removed per-observation decay (* (1.0 - ALPHA)) on success, keeping only time-based decay via stats.error_rate(now). This ensures error rate only decays naturally over time, not artificially on each observation.
  • Error rate capping: Added .min(ERROR_RATE_MAX) to clamp error rate at 1.0, ensuring it remains a valid probability.
  • Increased position penalty: Changed from 200 to 2000 points per position, strongly preferring client bucket ordering (e.g., region/AZ locality).
  • Increased error multiplier: Changed from 100x to 100,000x, ensuring reliability dominates latency in scoring decisions.
  • Tie-breaking: Added index as secondary sort key (self.score(now, bucket, *i), *i) to deterministically preserve client order when scores tie.
  • Test updates: Updated all tests to reflect new scoring constants and added 3 new tests for tie-breaking, error rate capping, and success observation behavior.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk.
  • The changes are well-tested with comprehensive unit tests covering all modified behavior. The mathematical logic for error rate decay is now more correct (time-based only, properly capped). The scoring changes are intentional design improvements with clear rationale. All tests pass and the changes align with the stated goals in the PR description.
  • No files require special attention.

Important Files Changed

Filename Overview
src/object_store/stats.rs Refactors bucket scoring to prioritize client order and reliability. Fixes error rate decay to use only time-based decay and clamps error rate to 1.0. Increases position penalty and error multiplier to strongly prefer client-ordered buckets and healthy buckets over failing ones. Adds comprehensive tests.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[observe bucket, outcome] --> B{outcome?}
    B -->|Success| C[Reset consecutive_failures]
    B -->|Failure| D[Increment consecutive_failures]
    B --> E[Apply time-based decay to error_rate]
    E --> F{outcome?}
    F -->|Success| G[Update latency histogram]
    F -->|Failure| H[error_rate += ALPHA]
    H --> I[Clamp error_rate to max 1.0]
    I --> J[Update last_failure_time]
    G --> K[Update last_update]
    J --> K
    C --> G
    D --> H
    
    L[score bucket] --> M[Calculate base = idx * 2000]
    M --> N{bucket known?}
    N -->|No| O[Return base + 5000]
    N -->|Yes| P[Get latency snapshot]
    P --> Q[Calculate lat = mean_micros / 100]
    Q --> R{circuit_open?}
    R -->|Yes| S[err = 1_000_000]
    R -->|No| T[err = error_rate * 100_000]
    S --> U[Return base + err + lat]
    T --> U
    
    V[attempt_order] --> W[Sort by score, then index]
    W --> X[Preserve client order on ties]
Loading

Last reviewed commit: 500ecc6

@shikhar shikhar merged commit 08f8e1f into main Feb 19, 2026
5 checks passed
@shikhar shikhar deleted the fix/issue-43-error-rate-decay branch February 19, 2026 04:57
@github-actions github-actions bot mentioned this pull request Feb 19, 2026
shikhar pushed a commit that referenced this pull request Feb 19, 2026
## 🤖 New release

* `cachey`: 0.10.0 -> 0.10.1

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

##
[0.10.1](0.10.0...0.10.1)
- 2026-02-19

### Fixed

- *(throughput)* use fractional lookback divisor
([#78](#78))
- *(object_store)* prioritize client-ordered buckets and correct decay
([#76](#76))
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant