Skip to content

Commit 0cb51a4

Browse files
authored
Merge pull request #4177 from somratdutta/add-lakekeeper-catalog
Add Lakekeeper catalog support in docs
2 parents 67a1d86 + 3b48cad commit 0cb51a4

File tree

5 files changed

+373
-0
lines changed

5 files changed

+373
-0
lines changed

docs/integrations/index.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -246,6 +246,7 @@ We are actively compiling this list of ClickHouse integrations below, so it's no
246246
|Redis|<Redissvg alt="Redis logo" style={{width: '3rem', 'height': '3rem'}}/>|Data ingestion|Allows ClickHouse to use [Redis](https://redis.io/) as a dictionary source.|[Documentation](/sql-reference/dictionaries/index.md#redis)|
247247
|Redpanda|<Image img={redpanda} alt="Redpanda logo" size="logo"/>|Data ingestion|Redpanda is the streaming data platform for developers. It's API-compatible with Apache Kafka, but 10x faster, much easier to use, and more cost effective|[Blog](https://redpanda.com/blog/real-time-olap-database-clickhouse-redpanda)|
248248
|REST Catalog||Data ingestion|Integration with REST Catalog specification for Iceberg tables, supporting multiple catalog providers including Tabular.io.|[Documentation](/use-cases/data-lake/rest-catalog)|
249+
|Lakekeeper||Data ingestion|Integration with Lakekeeper, an open-source REST catalog implementation for Apache Iceberg with multi-tenant support.|[Documentation](/use-cases/data-lake/lakekeeper-catalog)|
249250
|Nessie||Data ingestion|Integration with Nessie, an open-source transactional catalog for data lakes with Git-like data version control.|[Documentation](/use-cases/data-lake/nessie-catalog)|
250251
|Rust|<Image img={rust} size="logo" alt="Rust logo"/>|Language client|A typed client for ClickHouse|[Documentation](/integrations/language-clients/rust.md)|
251252
|SQLite|<Sqlitesvg alt="Sqlite logo" style={{width: '3rem', 'height': '3rem'}}/>|Data ingestion|Allows to import and export data to SQLite and supports queries to SQLite tables directly from ClickHouse.|[Documentation](/engines/table-engines/integrations/sqlite)|

docs/use-cases/data_lake/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,5 @@ ClickHouse supports integration with multiple catalogs (Unity, Glue, REST, Polar
1414
| [Querying data in S3 using ClickHouse and the Glue Data Catalog](/use-cases/data-lake/glue-catalog) | Query your data in S3 buckets using ClickHouse and the Glue Data Catalog. |
1515
| [Querying data in S3 using ClickHouse and the Unity Data Catalog](/use-cases/data-lake/unity-catalog) | Query your using the Unity Catalog. |
1616
| [Querying data in S3 using ClickHouse and the REST Catalog](/use-cases/data-lake/rest-catalog) | Query your data using the REST Catalog (Tabular.io). |
17+
| [Querying data in S3 using ClickHouse and the Lakekeeper Catalog](/use-cases/data-lake/lakekeeper-catalog) | Query your data using the Lakekeeper Catalog. |
1718
| [Querying data in S3 using ClickHouse and the Nessie Catalog](/use-cases/data-lake/nessie-catalog) | Query your data using the Nessie Catalog with Git-like data version control. |
Lines changed: 366 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,366 @@
1+
---
2+
slug: /use-cases/data-lake/lakekeeper-catalog
3+
sidebar_label: 'Lakekeeper Catalog'
4+
title: 'Lakekeeper Catalog'
5+
pagination_prev: null
6+
pagination_next: null
7+
description: 'In this guide, we will walk you through the steps to query
8+
your data using ClickHouse and the Lakekeeper Catalog.'
9+
keywords: ['Lakekeeper', 'REST', 'Tabular', 'Data Lake', 'Iceberg']
10+
show_related_blogs: true
11+
---
12+
13+
import ExperimentalBadge from '@theme/badges/ExperimentalBadge';
14+
15+
<ExperimentalBadge/>
16+
17+
:::note
18+
Integration with the Lakekeeper Catalog works with Iceberg tables only.
19+
This integration supports both AWS S3 and other cloud storage providers.
20+
:::
21+
22+
ClickHouse supports integration with multiple catalogs (Unity, Glue, REST, Polaris, etc.). This guide will walk you through the steps to query your data using ClickHouse and the [Lakekeeper](https://docs.lakekeeper.io/) catalog.
23+
24+
Lakekeeper is an open-source REST catalog implementation for Apache Iceberg that provides:
25+
- **Rust native** implementation for high performance and reliability
26+
- **REST API** compliance with the Iceberg REST catalog specification
27+
- **Cloud storage** integration with S3-compatible storage
28+
29+
:::note
30+
As this feature is experimental, you will need to enable it using:
31+
`SET allow_experimental_database_iceberg = 1;`
32+
:::
33+
34+
## Local Development Setup {#local-development-setup}
35+
36+
For local development and testing, you can use a containerized Lakekeeper setup. This approach is ideal for learning, prototyping, and development environments.
37+
38+
### Prerequisites {#local-prerequisites}
39+
40+
1. **Docker and Docker Compose**: Ensure Docker is installed and running
41+
2. **Sample Setup**: You can use the Lakekeeper docker-compose setup
42+
43+
### Setting up Local Lakekeeper Catalog {#setting-up-local-lakekeeper-catalog}
44+
45+
You can use the official [Lakekeeper docker-compose setup](https://github.com/lakekeeper/lakekeeper/tree/main/examples/minimal) which provides a complete environment with Lakekeeper, PostgreSQL metadata backend, and MinIO for object storage.
46+
47+
**Step 1:** Create a new folder in which to run the example, then create a file `docker-compose.yml` with the following configuration:
48+
49+
```yaml
50+
version: '3.8'
51+
52+
services:
53+
lakekeeper:
54+
image: quay.io/lakekeeper/catalog:latest
55+
environment:
56+
- LAKEKEEPER__PG_ENCRYPTION_KEY=This-is-NOT-Secure!
57+
- LAKEKEEPER__PG_DATABASE_URL_READ=postgresql://postgres:postgres@db:5432/postgres
58+
- LAKEKEEPER__PG_DATABASE_URL_WRITE=postgresql://postgres:postgres@db:5432/postgres
59+
- RUST_LOG=info
60+
command: ["serve"]
61+
healthcheck:
62+
test: ["CMD", "/home/nonroot/lakekeeper", "healthcheck"]
63+
interval: 1s
64+
timeout: 10s
65+
retries: 10
66+
start_period: 30s
67+
depends_on:
68+
migrate:
69+
condition: service_completed_successfully
70+
db:
71+
condition: service_healthy
72+
minio:
73+
condition: service_healthy
74+
ports:
75+
- 8181:8181
76+
networks:
77+
- iceberg_net
78+
79+
migrate:
80+
image: quay.io/lakekeeper/catalog:latest-main
81+
environment:
82+
- LAKEKEEPER__PG_ENCRYPTION_KEY=This-is-NOT-Secure!
83+
- LAKEKEEPER__PG_DATABASE_URL_READ=postgresql://postgres:postgres@db:5432/postgres
84+
- LAKEKEEPER__PG_DATABASE_URL_WRITE=postgresql://postgres:postgres@db:5432/postgres
85+
- RUST_LOG=info
86+
restart: "no"
87+
command: ["migrate"]
88+
depends_on:
89+
db:
90+
condition: service_healthy
91+
networks:
92+
- iceberg_net
93+
94+
bootstrap:
95+
image: curlimages/curl
96+
depends_on:
97+
lakekeeper:
98+
condition: service_healthy
99+
restart: "no"
100+
command:
101+
- -w
102+
- "%{http_code}"
103+
- "-X"
104+
- "POST"
105+
- "-v"
106+
- "http://lakekeeper:8181/management/v1/bootstrap"
107+
- "-H"
108+
- "Content-Type: application/json"
109+
- "--data"
110+
- '{"accept-terms-of-use": true}'
111+
- "-o"
112+
- "/dev/null"
113+
networks:
114+
- iceberg_net
115+
116+
initialwarehouse:
117+
image: curlimages/curl
118+
depends_on:
119+
lakekeeper:
120+
condition: service_healthy
121+
bootstrap:
122+
condition: service_completed_successfully
123+
restart: "no"
124+
command:
125+
- -w
126+
- "%{http_code}"
127+
- "-X"
128+
- "POST"
129+
- "-v"
130+
- "http://lakekeeper:8181/management/v1/warehouse"
131+
- "-H"
132+
- "Content-Type: application/json"
133+
- "--data"
134+
- '{"warehouse-name": "demo", "project-id": "00000000-0000-0000-0000-000000000000", "storage-profile": {"type": "s3", "bucket": "warehouse-rest", "key-prefix": "", "assume-role-arn": null, "endpoint": "http://minio:9000", "region": "local-01", "path-style-access": true, "flavor": "minio", "sts-enabled": true}, "storage-credential": {"type": "s3", "credential-type": "access-key", "aws-access-key-id": "minio", "aws-secret-access-key": "ClickHouse_Minio_P@ssw0rd"}}'
135+
- "-o"
136+
- "/dev/null"
137+
networks:
138+
- iceberg_net
139+
140+
db:
141+
image: bitnami/postgresql:16.3.0
142+
environment:
143+
- POSTGRESQL_USERNAME=postgres
144+
- POSTGRESQL_PASSWORD=postgres
145+
- POSTGRESQL_DATABASE=postgres
146+
healthcheck:
147+
test: ["CMD-SHELL", "pg_isready -U postgres -p 5432 -d postgres"]
148+
interval: 2s
149+
timeout: 10s
150+
retries: 5
151+
start_period: 10s
152+
volumes:
153+
- postgres_data:/bitnami/postgresql
154+
networks:
155+
- iceberg_net
156+
157+
minio:
158+
image: bitnami/minio:2025.4.22
159+
environment:
160+
- MINIO_ROOT_USER=minio
161+
- MINIO_ROOT_PASSWORD=ClickHouse_Minio_P@ssw0rd
162+
- MINIO_API_PORT_NUMBER=9000
163+
- MINIO_CONSOLE_PORT_NUMBER=9001
164+
- MINIO_SCHEME=http
165+
- MINIO_DEFAULT_BUCKETS=warehouse-rest
166+
networks:
167+
iceberg_net:
168+
aliases:
169+
- warehouse-rest.minio
170+
ports:
171+
- "9002:9000"
172+
- "9003:9001"
173+
healthcheck:
174+
test: ["CMD", "mc", "ls", "local", "|", "grep", "warehouse-rest"]
175+
interval: 2s
176+
timeout: 10s
177+
retries: 3
178+
start_period: 15s
179+
volumes:
180+
- minio_data:/bitnami/minio/data
181+
182+
clickhouse:
183+
image: clickhouse/clickhouse-server:head
184+
container_name: lakekeeper-clickhouse
185+
user: '0:0' # Ensures root permissions
186+
ports:
187+
- "8123:8123"
188+
- "9000:9000"
189+
volumes:
190+
- clickhouse_data:/var/lib/clickhouse
191+
- ./clickhouse/data_import:/var/lib/clickhouse/data_import # Mount dataset folder
192+
networks:
193+
- iceberg_net
194+
environment:
195+
- CLICKHOUSE_DB=default
196+
- CLICKHOUSE_USER=default
197+
- CLICKHOUSE_DO_NOT_CHOWN=1
198+
- CLICKHOUSE_PASSWORD=
199+
depends_on:
200+
lakekeeper:
201+
condition: service_healthy
202+
minio:
203+
condition: service_healthy
204+
205+
volumes:
206+
postgres_data:
207+
minio_data:
208+
clickhouse_data:
209+
210+
networks:
211+
iceberg_net:
212+
driver: bridge
213+
```
214+
215+
**Step 2:** Run the following command to start the services:
216+
217+
```bash
218+
docker compose up -d
219+
```
220+
221+
**Step 3:** Wait for all services to be ready. You can check the logs:
222+
223+
```bash
224+
docker-compose logs -f
225+
```
226+
227+
:::note
228+
The Lakekeeper setup requires that sample data be loaded into the Iceberg tables first. Make sure the environment has created and populated the tables before attempting to query them through ClickHouse. The availability of tables depends on the specific docker-compose setup and sample data loading scripts.
229+
:::
230+
231+
### Connecting to Local Lakekeeper Catalog {#connecting-to-local-lakekeeper-catalog}
232+
233+
Connect to your ClickHouse container:
234+
235+
```bash
236+
docker exec -it lakekeeper-clickhouse clickhouse-client
237+
```
238+
239+
Then create the database connection to the Lakekeeper catalog:
240+
241+
```sql
242+
SET allow_experimental_database_iceberg = 1;
243+
244+
CREATE DATABASE demo
245+
ENGINE = DataLakeCatalog('http://lakekeeper:8181/catalog', 'minio', 'ClickHouse_Minio_P@ssw0rd')
246+
SETTINGS catalog_type = 'rest', storage_endpoint = 'http://minio:9002/warehouse-rest', warehouse = 'demo'
247+
```
248+
249+
## Querying Lakekeeper catalog tables using ClickHouse {#querying-lakekeeper-catalog-tables-using-clickhouse}
250+
251+
Now that the connection is in place, you can start querying via the Lakekeeper catalog. For example:
252+
253+
```sql
254+
USE demo;
255+
256+
SHOW TABLES;
257+
```
258+
259+
If your setup includes sample data (such as the taxi dataset), you should see tables like:
260+
261+
```sql title="Response"
262+
┌─name──────────┐
263+
default.taxis
264+
└───────────────┘
265+
```
266+
267+
:::note
268+
If you don't see any tables, this usually means:
269+
1. The environment hasn't created the sample tables yet
270+
2. The Lakekeeper catalog service isn't fully initialized
271+
3. The sample data loading process hasn't completed
272+
273+
You can check the Spark logs to see the table creation progress:
274+
```bash
275+
docker-compose logs spark
276+
```
277+
:::
278+
279+
To query a table (if available):
280+
281+
```sql
282+
SELECT count(*) FROM `default.taxis`;
283+
```
284+
285+
```sql title="Response"
286+
┌─count()─┐
287+
2171187
288+
└─────────┘
289+
```
290+
291+
:::note Backticks required
292+
Backticks are required because ClickHouse doesn't support more than one namespace.
293+
:::
294+
295+
To inspect the table DDL:
296+
297+
```sql
298+
SHOW CREATE TABLE `default.taxis`;
299+
```
300+
301+
```sql title="Response"
302+
┌─statement─────────────────────────────────────────────────────────────────────────────────────┐
303+
│ CREATE TABLE demo.`default.taxis`
304+
│ ( │
305+
`VendorID` Nullable(Int64), │
306+
`tpep_pickup_datetime` Nullable(DateTime64(6)), │
307+
`tpep_dropoff_datetime` Nullable(DateTime64(6)), │
308+
`passenger_count` Nullable(Float64), │
309+
`trip_distance` Nullable(Float64), │
310+
`RatecodeID` Nullable(Float64), │
311+
`store_and_fwd_flag` Nullable(String), │
312+
`PULocationID` Nullable(Int64), │
313+
`DOLocationID` Nullable(Int64), │
314+
`payment_type` Nullable(Int64), │
315+
`fare_amount` Nullable(Float64), │
316+
`extra` Nullable(Float64), │
317+
`mta_tax` Nullable(Float64), │
318+
`tip_amount` Nullable(Float64), │
319+
`tolls_amount` Nullable(Float64), │
320+
`improvement_surcharge` Nullable(Float64), │
321+
`total_amount` Nullable(Float64), │
322+
`congestion_surcharge` Nullable(Float64), │
323+
`airport_fee` Nullable(Float64) │
324+
│ ) │
325+
│ ENGINE = Iceberg('http://minio:9002/warehouse-rest/warehouse/default/taxis/', 'minio', '[HIDDEN]') │
326+
└───────────────────────────────────────────────────────────────────────────────────────────────┘
327+
```
328+
329+
## Loading data from your Data Lake into ClickHouse {#loading-data-from-your-data-lake-into-clickhouse}
330+
331+
If you need to load data from the Lakekeeper catalog into ClickHouse, start by creating a local ClickHouse table:
332+
333+
```sql
334+
CREATE TABLE taxis
335+
(
336+
`VendorID` Int64,
337+
`tpep_pickup_datetime` DateTime64(6),
338+
`tpep_dropoff_datetime` DateTime64(6),
339+
`passenger_count` Float64,
340+
`trip_distance` Float64,
341+
`RatecodeID` Float64,
342+
`store_and_fwd_flag` String,
343+
`PULocationID` Int64,
344+
`DOLocationID` Int64,
345+
`payment_type` Int64,
346+
`fare_amount` Float64,
347+
`extra` Float64,
348+
`mta_tax` Float64,
349+
`tip_amount` Float64,
350+
`tolls_amount` Float64,
351+
`improvement_surcharge` Float64,
352+
`total_amount` Float64,
353+
`congestion_surcharge` Float64,
354+
`airport_fee` Float64
355+
)
356+
ENGINE = MergeTree()
357+
PARTITION BY toYYYYMM(tpep_pickup_datetime)
358+
ORDER BY (VendorID, tpep_pickup_datetime, PULocationID, DOLocationID);
359+
```
360+
361+
Then load the data from your Lakekeeper catalog table via an `INSERT INTO SELECT`:
362+
363+
```sql
364+
INSERT INTO taxis
365+
SELECT * FROM demo.`default.taxis`;
366+
```

0 commit comments

Comments
 (0)