Skip to content

Add per-field dimension to caches#309

Open
zipdoki wants to merge 14 commits into
mainfrom
feat/cache-dimension
Open

Add per-field dimension to caches#309
zipdoki wants to merge 14 commits into
mainfrom
feat/cache-dimension

Conversation

@zipdoki
Copy link
Copy Markdown
Contributor

@zipdoki zipdoki commented May 7, 2026

Summary

As discussed in offline meeting, adds an optional dimension whitelist to cache fields. When set, only edges whose value for that field is in the whitelist are written to the cache; other edges are skipped at write time. Read paths are unchanged — filtered rows are simply absent from the wide row.

Changes

  • IndexField (V3) and Cache.Field (V2 codec) gain dimension with empty-set validation.
  • EdgeMutationBuilder.buildCacheRecords skips the cache record when any field is outside its dimension.
  • AbstractEdgeEncoder.encodeAllCacheEdges applies the same filter on V2 bulk writes.
  • Test PrettyObjectWriter matches production mapper's NON_NULL policy.

How to Test

  • ./gradlew :core:test :codec-java:test :server:test :engine:test
  • EdgeMutationBuilderTest.CacheDimensionFilter — in/out-of-dimension, null dimension, UPDATED transitions, empty-set reject.
  • EdgeCacheQueryE2ETest, ActionbaseQueryE2ETest — direct seek and multi-hop CACHE.

AI Assistance

  • This PR was written largely with AI assistance.
    • Tool / model: Claude Code (Opus 4.7, 1M context)

zipdoki and others added 3 commits May 7, 2026 18:55
Caches gain an optional `dimension` field — a whitelist of allowed values
for that field. When set, only edges whose value for the field is in the
list are written to the cache; other edges are silently skipped at write
time. Read-side code is unchanged: filtered rows are simply absent from
the wide row, so existing range filters and pagination keep working.

`dimension` is held as a `Set` (membership-only, dedup'd by Jackson at
deserialization). Empty sets are rejected at construction to surface DDL
mistakes. V2 `Cache.Field` mirrors the V3 `IndexField` shape, including
defensive immutability so caller mutations can't drift the whitelist.

Test plan:
- core: `EdgeMutationBuilderTest` cases for in/out-of-dimension writes,
  null dimension regression, UPDATED transitions across the boundary,
  and empty-set construction reject.
- server: `EdgeCacheQueryE2ETest` and `ActionbaseQueryE2ETest` exercise
  the write-time filter via direct seek and multi-hop CACHE.
- testFixtures: align `PrettyObjectWriter` with production
  `ActionbaseObjectMapper`'s `NON_NULL` inclusion so schema round-trip
  fixtures don't regress on new nullable fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Defensive copy via `new HashSet<>(dimension)` is enough — the surrounding
`Cache.fields` list isn't unmodifiable either, so wrapping only `dimension`
was inconsistent. `HashSet` over `LinkedHashSet`: a whitelist is order-
independent by definition; preserving JSON input order would contradict
the Set semantics we already chose.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Cache field values are direction-independent, so computing them once per
cache (alongside the dimension match check) instead of once per direction
saves an allocation pair on `BOTH` mutations.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@zipdoki zipdoki self-assigned this May 7, 2026
@zipdoki zipdoki requested a review from em3s as a code owner May 7, 2026 10:24
@zipdoki zipdoki removed the request for review from em3s May 7, 2026 10:24
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels May 7, 2026
jacksonObjectMapper().apply {
setDefaultPrettyPrinter(PrettyPrinter())
configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
setSerializationInclusion(JsonInclude.Include.NON_NULL)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@em3s
Surfaced by nullable dimension — aligns test mapper with production NON_NULL so nullable fields don't break round-trip fixtures.

zipdoki and others added 2 commits May 7, 2026 19:29
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Two cases — matching and non-matching edges against a `dimension` whitelist
on `permission` — confirming the filter applies through `BulkEdgeEncoder.bulkEncodeAll`
without affecting the hash/indexed/counter rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@em3s
Copy link
Copy Markdown
Contributor

em3s commented May 7, 2026

Adding context from our offline discussion — this is groundwork for per-dimension limit distribution on multi-field EdgeCache. Today, with a two-field cache (field1, field2) and limit 100, both bulkload and pruner spend the budget in leading-field order, so we can't allocate it across field1 buckets — the bucket set isn't known up front.

Pinning that set in the label via dimension gives bulkload and pruner a shared source, enabling per-dimension top-N.

val field: String,
val order: Order,
)
val dimension: Set<Any>? = null,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dimension is EdgeCache-only and stays null on EdgeIndex — time to split off a CacheField.

@em3s
Copy link
Copy Markdown
Contributor

em3s commented May 7, 2026

Left a review comment. Please take a look. @zipdoki

`dimension` is EdgeCache-only and stays `null` on EdgeIndex, so move
it to a dedicated `CacheField` and keep `IndexField` to its index shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@em3s
Copy link
Copy Markdown
Contributor

em3s commented May 8, 2026

@zipdoki

As also discussed offline: instead of matches, pass the actual dimensionValue through to the caller — so bulkload (and similar) can compute top-N on rowkey + dimensionValue rather than rowkey alone.

Adds a `dimensionValue` byte/string field to `KeyFieldValue`/`TypedKeyFieldValue` and threads it through the cache-edge write path. The value encodes only the dimensioned fields (in declared order) and is shared across OUT/IN directions.

`Cache.Field.hasDimension()` treats `null` and `[]` as equivalent (no filter), preserving JSON round-trip while unifying behavior. `Cache` precomputes `dimensionedFields` so the hot path avoids re-filtering, and `passesAllDimensions` / `encodeDimensionValue` iterate the precomputed list directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@zipdoki
Copy link
Copy Markdown
Contributor Author

zipdoki commented May 8, 2026

@em3s
Thanks for both reviews.

1. Splitting CacheField off IndexField — done in Split CacheField off IndexField. dimension now lives only on CacheField.

2. Threading dimensionValue instead of matches — tried, reverted.

Three outcomes have to coexist: skip on mismatch, emit with dimensionValue = null, emit with an encoded value. Merging the predicate and the encoder into one return path looks like:

.flatMap(cache -> {
  List<Cache.Field> dimensionedFields = cache.getDimensionedFields();
  Object[] values = new Object[dimensionedFields.size()];
  for (int i = 0; i < dimensionedFields.size(); i++) {
    Cache.Field f = dimensionedFields.get(i);
    Object v = resolveFieldValue(f.getField(), ts, src, tgt, props);
    if (!f.getDimension().contains(v)) return Stream.empty();
    values[i] = v;
  }
  T dimensionValue = dimensionedFields.isEmpty()
      ? null
      : encodeBufferAsT(buffer -> { /* encode values[i] */ });
  return dirType.getDirs().stream()
      .map(dir -> encodeCacheEdge(...).withDimensionValue(dimensionValue));
})

Costs of merging:

  • A per-cache Object[] allocation just to carry resolved values between the validate phase and the encode phase — extra GC pressure on the hot path that the separated form avoids entirely.
  • Object[] + Stream.empty() sentinel to encode the three outcomes in a single return.
  • Lambda body grows to ~15 lines, vs. two ~5-line helpers.
  • Loses the natural filter (predicate) / flatMap (value + expand) mapping onto stream operators.
  • Harder to unit-test in isolation.

What the current shape pays for that:

.filter(cache -> passesAllDimensions(cache, ts, src, tgt, props))
.flatMap(cache -> {
  T dimensionValue = encodeDimensionValue(cache, ts, src, tgt, props);
  return dirType.getDirs().stream()
      .map(dir -> encodeCacheEdge(...).withDimensionValue(dimensionValue));
})

resolveFieldValue runs twice per dimensioned field, and the two helpers must stay aligned on iteration. Both iterate cache.getDimensionedFields(), so drift risk is low, and resolveFieldValue is a switch + HashMap.get — sub-100ns. Happy to revisit if that cost ever gets heavier.

zipdoki and others added 2 commits May 8, 2026 17:56
Added in 4c9c586 to keep schema fixtures stable against the new
nullable `IndexField.dimension`. After 77bbb5b split `CacheField` off
`IndexField`, `IndexField` has no nullable field again and no current
fixture serializes a `CacheField` with `dimension` set, so the
inclusion override no longer affects any test.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Existing dimension tests only exercise 2-field caches (one dimensioned
field plus `created_at`). Adds a 3-field case
(`permission`, `memo`, `created_at`) that verifies the encoder output
downstream per-dimension top-N depends on:

- four (permission, memo) buckets emit distinct `dimensionValue` tags,
- same-bucket edges with different `created_at` share a `dimensionValue`,
- configured field order is preserved in the byte encoding so
  same-permission siblings share a longer prefix than cross-permission
  pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@em3s
Copy link
Copy Markdown
Contributor

em3s commented May 8, 2026

Once the three-field tests pass and the bulkload/seek paths we're testing internally are green, we can merge and follow up with the codec-java release. @zipdoki

zipdoki and others added 5 commits May 8, 2026 18:27
`testCacheDimensionFourBucketsHaveDistinctOrderedTags` previously checked
distinctness, stability, and field-order prefix grouping. Adds an explicit
byte-wise comparison so ASC encoding is verified to sort the four buckets
in (permission, memo) lexicographic order: me/a < me/b < others/a < others/b.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This reverts commit 99dc9ec, reversing
changes made to 5f351b2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants