Skip to content

[catalog] Sketch out initial RFC scope#115

Draft
hachikuji wants to merge 3 commits intomainfrom
catalog-rfc-sketch
Draft

[catalog] Sketch out initial RFC scope#115
hachikuji wants to merge 3 commits intomainfrom
catalog-rfc-sketch

Conversation

@hachikuji
Copy link
Copy Markdown
Contributor

It would be useful to have a catalog for all opendata systems. This PR attempts to sketch out the potential scope for an initial catalog. We can complete design details once there's general consensus about it.

bucket: acme-data
```

## Goals
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I wonder, however, if we should add a concept of the 'owner' of a slate. for example, if you have a prometheus server that's backed by a slate TSDB. Simply deleting the s3 buckets will orphan the server. We need to be able to know that there is a service that depends on the slate, where it lives, etc. so that the operations maintain the integrity of the system.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. This gets into liveness I guess. I was hoping we could go with a model where the catalog only communicates with object storage. Perhaps we could add some kind of explicit fencing marker to each opendata system, which must be added by the writer. The purpose is to fence writers/readers and signal the catalog that it is safe to delete.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure I understand, the marker would be added by the writer to indicate that the slate is no longer being written to? And that makes it safe to delete? If so, I'd presume the readers would also have to write a marker. This is more like a lease system than fencing markers if that's how you are thinking it would work.

From a user perspective, if you have a service, you want to manage the lifecycle of the service as a whole. For example, I think a useful admin delete operation would deprovision the service and optionally delete the data. Not just delete the data.

Copy link
Copy Markdown
Contributor Author

@hachikuji hachikuji Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The marker could be like a poison pill inserted into the slatedb manifest. It would kill readers and writers. I think the main point is trying to define the communication model. How does the catalog interact with provisioning systems? How does the catalog interact with system readers/writers? The ideal from my perspective is that all communication with the catalog is done through object storage. It is a return of our "storage as protocol" idea at its heart. For example, the catalog could write provisioning requests as files in object storage. Some kind of k8s service could watch that file and do the actual provisioning work. Deletion workflows could follow a similar pattern.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downstream systems could just be catalog readers I guess. They might follow changes to the catalog as any SlateDb reader does and act when necessary.

Copy link
Copy Markdown
Contributor

@apurvam apurvam Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this line of thinking. So then the catalog needs to hold enough metadata to make those flows possible. I think this is captured under 'define process for registering/deleting slates' in your goals. So I can imagine this type of flow in the longer term.

  1. User issues a deprovisioning request for some OpenData database db0. this could be through a UI, terraform, whatever.
  2. This request is received by some control plane.
  3. The control plane writes this poison pill into the catalog.
  4. A k8s operator detects the poison pill and instructs the readers/writers services to shut down. This means it needs a mapping from the slate to the reader and writer pods. Alternately, the the readers and writer pods of db0 could catalog readers, read the catalog, and mark themselves for deprovisioning with the operator simply executing the action. I like the latter approach because it works for really any deployment model and doesn't require the operator to maintain additional metadata.

Does that match what you had in mind?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's right. I guess the main point is that downstream systems are just Readers in our Slatedb-backed framing. So provisioning systems would follow the catalog as readers. Perhaps they could even modify the catalog themselves by temporarily assuming the Writer role. Perhaps we do not need long-lived writers at all. If we could get a model like this to work, it would remove a huge amount of complexity. You don't need to have catalog as a persistent service.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some text to the RFC about the communication model. I like it. It leans into the advantages of slatedb/object storage.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I like the direction. I think SlateDB transactions will be crucial to making this work.


This RFC proposes a catalog system for OpenData that serves as a central management plane for OpenData storage systems. The catalog provides a single point to manage metadata about the systems ("slates") a user has installed, including their names, types, and object storage configuration. The catalog itself is implemented as a slate backed by SlateDB, following the same patterns as other OpenData subsystems.

## Motivation
Copy link
Copy Markdown
Contributor

@agavra agavra Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a review of this lined up but now I'm wondering whether this RFC is over-prescribed to a dogfood philosophy. What if instead we tried to design this as k8s-native?

I posed this question to Claude and here's an alternative we came up with:

The current RFC essentially builds a bespoke control plane on top of SlateDB, but if your primary deployment target is Kubernetes, you'd be reinventing machinery that K8s already provides (watches, reconciliation loops, status subresources, RBAC, etc.).

Instead of the catalog being a SlateDB-backed store that components poll, the Kubernetes API server becomes the catalog. Each slate is represented as a Custom Resource, and an operator reconciles desired state to actual state.

Then using the CRDs we could use:

$ kubectl get slates
NAME      TYPE        BUCKET                     PHASE
events    log         s3://acme-data/events      Provisioned
metrics   timeseries  s3://acme-data/metrics     Provisioned

$ kubectl apply -f - <<EOF
apiVersion: opendata.io/v1alpha1
kind: Slate
metadata:
  name: orders
spec:
  type: log
  objectStore:
    bucket: s3://acme-data/orders
EOF
slate.opendata.io/orders created

$ kubectl get slate orders -o jsonpath='{.status.phase}'
Provisioned

$ kubectl delete slate orders
slate.opendata.io/orders deleted

Tradeoffs

Aspect SlateDB-backed Catalog K8s-native CRDs
Dependency Only object storage Requires Kubernetes
Discovery Must know catalog location Standard K8s API discovery
Watches Polling (or custom mechanism) Native watch support
Auth/RBAC Custom K8s RBAC out of the box
Tooling Custom CLI kubectl, GitOps, Helm, etc.
Dogfooding Uses OpenData's own primitives External control plane
Portability Runs anywhere with object storage K8s-only (or needs abstraction)

The big benefit, then, of using opendata is that you install just one operator and one common language for CRDs instead of learning a new operator and new CRD for each of the data systems your deploy in your k8s stack.

Copy link
Copy Markdown
Contributor

@agavra agavra Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think philosophically it's OK to lean into kubernetes + object storage as the two primitives we rely on. The 'pitch' in my mind is that those two solve the hardest distributed systems problems: the former solves elastic compute and the latter solves elastic storage/consistency. Without both opendata's vision can't come to fruition.

Another big win with using kubectl as the primary control plane CLI is that the AI agents are really good with it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely design to be k8s native, but I don't think a catalog as being discussed here is occupying the same place as k8s. Any system will need some storage to figure out what's deployed, where it's deployed etc. The deployments very often span regions and k8s clusters. The question is: where is that information going to live. We need a catalog for that, which drives the k8s actions in a particular region.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I conflated the two. I believe we should start by figuring out the deployment models in a single k8s cluster and work up from there. The CLI as proposed here has a lot of overlap with the type of things that k8s should handle for me if its all within a single k8s cluster.

I'm not convinced that multi-region/multi-k8s is something we should figure out until we have a solid understanding of the single-k8s, multi-AZ design. A single k8s cluster can span multiple AZs, which is likely where 99% of data systems stop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the concern. I don't think we want the catalog directly involved in provisioning. At the same time, I'm not too comfortable being super dogmatic about k8s and sticking it at the heart of the system. I took a shot at reframing the catalog in the latest patch. Rather than tracking a target state, the catalog might simply track the current state. It might be aware of active readers/writers in the system. Kubernetes could consult the catalog prior to deprovisioning a resource rather than having the catalog drive deprovisioning itself for example. Not sure if this is enough value to justify the catalog's existence just yet. I suspect we need to let this stew for a while.

1. A user creates a Kubernetes Custom Resource specifying a new slate.
2. The K8s operator provisions the slate with a catalog reference in its configuration.
3. When the slate starts, it assumes the **Writer** role to register itself in the catalog.
4. CLI tooling or other components can observe the catalog as **Readers** to discover running slates.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new version a lot. I agree this needs time to bake. And I think the crux of what needs to bake is this 4th item. What do the readers actually do with the data in the catalog. I think this, more than anything, is what will inform if and how the catalog co-exists with orchestration systems like k8s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants