Skip to content

Commit c0cf844

Browse files
authored
Merge pull request #4181 from ClickHouse/4146-add-troubleshooting-section
4146 add troubleshooting section
2 parents c4a4898 + fc9ac78 commit c0cf844

19 files changed

+1008
-39
lines changed
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
---
2+
sidebar_position: 1
3+
slug: /tips-and-tricks/community-wisdom
4+
sidebar_label: 'Community Wisdom'
5+
doc_type: 'overview'
6+
keywords: [
7+
'database tips',
8+
'community wisdom',
9+
'production troubleshooting',
10+
'performance optimization',
11+
'database debugging',
12+
'clickhouse guides',
13+
'real world examples',
14+
'database best practices',
15+
'meetup insights',
16+
'production lessons',
17+
'interactive tutorials',
18+
'database solutions'
19+
]
20+
title: 'ClickHouse community wisdom'
21+
description: 'Learn from the ClickHouse community with real world scenarios and lessons learned'
22+
---
23+
24+
# ClickHouse community wisdom: tips and tricks from meetups {#community-wisdom}
25+
26+
*These interactive guides represent collective wisdom from hundreds of production deployments. Each runnable example helps you understand ClickHouse patterns using real GitHub events data - practice these concepts to avoid common mistakes and accelerate your success.*
27+
28+
Combine this collected knowledge with our [Best Practices](/best-practices) guide for optimal ClickHouse Experience.
29+
30+
## Problem-specific quick jumps {#problem-specific-quick-jumps}
31+
32+
| Issue | Document | Description |
33+
|-------|---------|-------------|
34+
| **Production issue** | [Debugging insights](./debugging-insights.md) | Community production debugging tips |
35+
| **Slow queries** | [Performance optimization](./performance-optimization.md) | Optimize Performance |
36+
| **Materialized views** | [MV double-edged sword](./materialized-views.md) | Avoid 10x storage instances |
37+
| **Too many parts** | [Too many parts](./too-many-parts.md) | Addressing the 'Too Many Parts' error and performance slowdown |
38+
| **High costs** | [Cost optimization](./cost-optimization.md) | Optimize Cost |
39+
| **Success stories** | [Success stories](./success-stories.md) | Examples of ClickHouse in successful use cases |
40+
41+
**Last Updated:** Based on community meetup insights through 2024-2025
42+
**Contributing:** Found a mistake or have a new lesson? Community contributions welcome
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
sidebar_position: 1
3+
slug: /community-wisdom/cost-optimization
4+
sidebar_label: 'Cost Optimization'
5+
doc_type: 'how-to-guide'
6+
keywords: [
7+
'cost optimization',
8+
'storage costs',
9+
'partition management',
10+
'data retention',
11+
'storage analysis',
12+
'database optimization',
13+
'clickhouse cost reduction',
14+
'storage hot spots',
15+
'ttl performance',
16+
'disk usage',
17+
'compression strategies',
18+
'retention analysis'
19+
]
20+
title: 'Lessons - cost optimization'
21+
description: 'Cost optimization strategies from ClickHouse community meetups with real production examples and verified techniques.'
22+
---
23+
24+
# Cost optimization: strategies from the community {#cost-optimization}
25+
*This guide is part of a collection of findings gained from community meetups. The findings on this page cover community wisdom related to optimizing cost while using ClickHouse that worked well for their specific experience and setup. For more real world solutions and insights you can [browse by specific problem](./community-wisdom.md).*
26+
27+
*Learn about how [ClickHouse Cloud can help manage operational costs](/cloud/overview)*.
28+
29+
## Compression strategy: LZ4 vs ZSTD in production {#compression-strategy}
30+
31+
When Microsoft Clarity needed to handle hundreds of terabytes of data, they discovered that compression choices have dramatic cost implications. At their scale, every bit of storage savings matters, and they faced a classic trade-off: performance versus storage costs. Microsoft Clarity handles massive volumes—two petabytes of uncompressed data per month across all accounts, processing around 60,000 queries per hour across eight nodes and serving billions of page views from millions of websites. At this scale, compression strategy becomes a critical cost factor.
32+
33+
They initially used ClickHouse's default [LZ4](/sql-reference/statements/create/table#lz4) compression but discovered significant cost savings were possible with [ZSTD](/sql-reference/statements/create/table#zstd). While LZ4 is faster, ZSTD provides better compression at the cost of slightly slower performance. After testing both approaches, they made a strategic decision to prioritize storage savings. The results were significant: 50% storage savings on large tables with manageable performance impact on ingestion and queries.
34+
35+
**Key results:**
36+
- 50% storage savings on large tables through ZSTD compression
37+
- 2 petabytes monthly data processing capacity
38+
- Manageable performance impact on ingestion and queries
39+
- Significant cost reduction at hundreds of TB scale
40+
41+
## Column-based retention strategy {#column-retention}
42+
43+
One of the most powerful cost optimization techniques comes from analyzing which columns are actually being used. Microsoft Clarity implements sophisticated column-based retention strategies using ClickHouse's built-in telemetry capabilities. ClickHouse provides detailed metrics on storage usage by column as well as comprehensive query patterns: which columns are accessed, how frequently, query duration, and overall usage statistics.
44+
45+
This data-driven approach enables strategic decisions about retention policies and column lifecycle management. By analyzing this telemetry data, Microsoft can identify storage hot spots - columns that consume significant space but receive minimal queries. For these low-usage columns, they can implement aggressive retention policies, reducing storage time from 30 months to just one month, or delete the columns entirely if they're not queried at all. This selective retention strategy reduces storage costs without impacting user experience.
46+
47+
**The strategy:**
48+
- Analyze column usage patterns using ClickHouse telemetry
49+
- Identify high-storage, low-query columns
50+
- Implement selective retention policies
51+
- Monitor query patterns for data-driven decisions
52+
53+
**Related docs**
54+
- [Managing Data - Column Level TTL](/observability/managing-data)
55+
56+
## Partition-based data management {#partition-management}
57+
58+
Microsoft Clarity discovered that partitioning strategy impacts both performance and operational simplicity. Their approach: partition by date, order by hour. This strategy delivers multiple benefits beyond just cleanup efficiency—it enables trivial data cleanup, simplifies billing calculations for their customer-facing service, and supports GDPR compliance requirements for row-based deletion.
59+
60+
**Key benefits:**
61+
- Trivial data cleanup (drop partition vs row-by-row deletion)
62+
- Simplified billing calculations
63+
- Better query performance through partition elimination
64+
- Easier operational management
65+
66+
**Related docs**
67+
- [Managing Data - Partitions](/observability/managing-data#partitions)
68+
69+
## String-to-integer conversion strategy {#string-integer-conversion}
70+
71+
Analytics platforms often face a storage challenge with categorical data that appears repeatedly across millions of rows. Microsoft's engineering team encountered this problem with their search analytics data and developed an effective solution that achieved 60% storage reduction on affected datasets.
72+
73+
In Microsoft's web analytics system, search results trigger different types of answers - weather cards, sports information, news articles, and factual responses. Each query result was tagged with descriptive strings like "weather_answer," "sports_answer," or "factual_answer." With billions of search queries processed, these string values were being stored repeatedly in ClickHouse, consuming massive amounts of storage space and requiring expensive string comparisons during queries.
74+
75+
Microsoft implemented a string-to-integer mapping system using a separate MySQL database. Instead of storing the actual strings in ClickHouse, they store only integer IDs. When users run queries through the UI and request data for `weather_answer`, their query optimizer first consults the MySQL mapping table to get the corresponding integer ID, then converts the query to use that integer before sending it to ClickHouse.
76+
77+
This architecture preserves the user experience - people still see meaningful labels like `weather_answer` in their dashboards - while the backend storage and queries operate on much more efficient integers. The mapping system handles all translation transparently, requiring no changes to the user interface or user workflows.
78+
79+
**Key benefits:**
80+
- 60% storage reduction on affected datasets
81+
- Faster query performance on integer comparisons
82+
- Reduced memory usage for joins and aggregations
83+
- Lower network transfer costs for large result sets
84+
85+
:::note
86+
This is a an example specifically used for Microsoft Clarity's data scenario. If you have all your data in ClickHouse or do not have constraints against moving data to ClickHouse, try using [dictionaries](/dictionary) instead.
87+
:::
88+
89+
## Video sources {#video-sources}
90+
91+
- **[Microsoft Clarity and ClickHouse](https://www.youtube.com/watch?v=rUVZlquVGw0)** - Microsoft Clarity Team
92+
- **[ClickHouse journey in Contentsquare](https://www.youtube.com/watch?v=zvuCBAl2T0Q)** - Doron Hoffman & Guram Sigua (ContentSquare)
93+
94+
*These community cost optimization insights represent strategies from companies processing hundreds of terabytes to petabytes of data, showing real-world approaches to reducing ClickHouse operational costs.*
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
---
2+
sidebar_position: 1
3+
slug: /community-wisdom/debugging-insights
4+
sidebar_label: 'Debugging Insights'
5+
doc_type: 'how-to-guide'
6+
keywords: [
7+
'clickhouse troubleshooting',
8+
'clickhouse errors',
9+
'slow queries',
10+
'memory problems',
11+
'connection issues',
12+
'performance optimization',
13+
'database errors',
14+
'configuration problems',
15+
'debug',
16+
'solutions'
17+
]
18+
title: 'Lessons - debugging insights'
19+
description: 'Find solutions to the most common ClickHouse problems including slow queries, memory errors, connection issues, and configuration problems.'
20+
---
21+
22+
# ClickHouse operations: community debugging insights {#clickhouse-operations-community-debugging-insights}
23+
*This guide is part of a collection of findings gained from community meetups. For more real world solutions and insights you can [browse by specific problem](./community-wisdom.md).*
24+
*Suffering from high operational costs? Check out the [Cost Optimization](./cost-optimization.md) community insights guide.*
25+
26+
## Essential system tables {#essential-system-tables}
27+
28+
These system tables are fundamental for production debugging:
29+
30+
### system.errors {#system-errors}
31+
32+
Shows all active errors in your ClickHouse instance.
33+
34+
```sql
35+
SELECT name, value, changed
36+
FROM system.errors
37+
WHERE value > 0
38+
ORDER BY value DESC;
39+
```
40+
41+
### system.replicas {#system-replicas}
42+
43+
Contains replication lag and status information for monitoring cluster health.
44+
45+
```sql
46+
SELECT database, table, replica_name, absolute_delay, queue_size, inserts_in_queue
47+
FROM system.replicas
48+
WHERE absolute_delay > 60
49+
ORDER BY absolute_delay DESC;
50+
```
51+
52+
### system.replication_queue {#system-replication-queue}
53+
54+
Provides detailed information for diagnosing replication problems.
55+
56+
```sql
57+
SELECT database, table, replica_name, position, type, create_time, last_exception
58+
FROM system.replication_queue
59+
WHERE last_exception != ''
60+
ORDER BY create_time DESC;
61+
```
62+
63+
### system.merges {#system-merges}
64+
65+
Shows current merge operations and can identify stuck processes.
66+
67+
```sql
68+
SELECT database, table, elapsed, progress, is_mutation, total_size_bytes_compressed
69+
FROM system.merges
70+
ORDER BY elapsed DESC;
71+
```
72+
73+
### system.parts {#system-parts}
74+
75+
Essential for monitoring part counts and identifying fragmentation issues.
76+
77+
```sql
78+
SELECT database, table, count() as part_count
79+
FROM system.parts
80+
WHERE active = 1
81+
GROUP BY database, table
82+
ORDER BY count() DESC;
83+
```
84+
85+
## Common production issues {#common-production-issues}
86+
87+
### Disk space problems {#disk-space-problems}
88+
89+
Disk space exhaustion in replicated setups creates cascading problems. When one node runs out of space, other nodes continue trying to sync with it, causing network traffic spikes and confusing symptoms. One community member spent 4 hours debugging what was simply low disk space. Check out this [query](/knowledgebase/useful-queries-for-troubleshooting#show-disk-storage-number-of-parts-number-of-rows-in-systemparts-and-marks-across-databases) to monitor your disk storage on a particular cluster.
90+
91+
AWS users should be aware that default general purpose EBS volumes have a 16TB limit.
92+
93+
### Too many parts error {#too-many-parts-error}
94+
95+
Small frequent inserts create performance problems. The community has identified that insert rates above 10 per second often trigger "too many parts" errors because ClickHouse cannot merge parts fast enough.
96+
97+
**Solutions:**
98+
- Batch data using 30-second or 200MB thresholds
99+
- Enable async_insert for automatic batching
100+
- Use buffer tables for server-side batching
101+
- Configure Kafka for controlled batch sizes
102+
103+
[Official recommendation](/best-practices/selecting-an-insert-strategy#batch-inserts-if-synchronous): minimum 1,000 rows per insert, ideally 10,000 to 100,000.
104+
105+
### Invalid timestamps issues {#data-quality-issues}
106+
107+
Applications that send data with arbitrary timestamps create partition problems. This leads to partitions with data from unrealistic dates (like 1998 or 2050), causing unexpected storage behavior.
108+
109+
### `ALTER` operation risks {#alter-operation-risks}
110+
111+
Large `ALTER` operations on multi-terabyte tables can consume significant resources and potentially lock databases. One community example involved changing an Integer to a Float on 14TB of data, which locked the entire database and required rebuilding from backups.
112+
113+
**Monitor expensive mutations:**
114+
115+
```sql
116+
SELECT database, table, mutation_id, command, parts_to_do, is_done
117+
FROM system.mutations
118+
WHERE is_done = 0;
119+
```
120+
121+
Test schema changes on smaller datasets first.
122+
123+
## Memory and performance {#memory-and-performance}
124+
125+
### External aggregation {#external-aggregation}
126+
127+
Enable external aggregation for memory-intensive operations. It's slower but prevents out-of-memory crashes by spilling to disk. You can do this by using `max_bytes_before_external_group_by` which will help prevent out of memory crashes on large `GROUP BY` operations. You can learn more about this setting [here](/operations/settings/settings#max_bytes_before_external_group_by).
128+
129+
```sql
130+
SELECT
131+
column1,
132+
column2,
133+
COUNT(*) as count,
134+
SUM(value) as total
135+
FROM large_table
136+
GROUP BY column1, column2
137+
SETTINGS max_bytes_before_external_group_by = 1000000000; -- 1GB threshold
138+
```
139+
140+
### Async insert details {#async-insert-details}
141+
142+
Async insert automatically batches small inserts server-side to improve performance. You can configure whether to wait for data to be written to disk before returning acknowledgment - immediate return is faster but less durable. Modern versions support deduplication to handle duplicate data within batches.
143+
144+
**Related docs**
145+
- [Selecting an insert strategy](/best-practices/selecting-an-insert-strategy#asynchronous-inserts)
146+
147+
### Distributed table configuration {#distributed-table-configuration}
148+
149+
By default, distributed tables use single-threaded inserts. Enable `insert_distributed_sync` for parallel processing and immediate data sending to shards.
150+
151+
Monitor temporary data accumulation when using distributed tables.
152+
153+
### Performance monitoring thresholds {#performance-monitoring-thresholds}
154+
155+
Community-recommended monitoring thresholds:
156+
- Parts per partition: preferably less than 100
157+
- Delayed inserts: should stay at zero
158+
- Insert rate: limit to about 1 per second for optimal performance
159+
160+
**Related docs**
161+
- [Custom partitioning key](/engines/table-engines/mergetree-family/custom-partitioning-key)
162+
163+
## Quick reference {#quick-reference}
164+
165+
| Issue | Detection | Solution |
166+
|-------|-----------|----------|
167+
| Disk Space | Check `system.parts` total bytes | Monitor usage, plan scaling |
168+
| Too Many Parts | Count parts per table | Batch inserts, enable async_insert |
169+
| Replication Lag | Check `system.replicas` delay | Monitor network, restart replicas |
170+
| Bad Data | Validate partition dates | Implement timestamp validation |
171+
| Stuck Mutations | Check `system.mutations` status | Test on small data first |
172+
173+
### Video sources {#video-sources}
174+
- [10 Lessons from Operating ClickHouse](https://www.youtube.com/watch?v=liTgGiTuhJE)
175+
- [Fast, Concurrent, and Consistent Asynchronous INSERTS in ClickHouse](https://www.youtube.com/watch?v=AsMPEfN5QtM)
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
sidebar_position: 1
3+
slug: /tips-and-tricks/materialized-views
4+
sidebar_label: 'Materialized Views'
5+
doc_type: 'how-to'
6+
keywords: [
7+
'clickhouse materialized views',
8+
'materialized view optimization',
9+
'materialized view storage issues',
10+
'materialized view best practices',
11+
'database aggregation patterns',
12+
'materialized view anti-patterns',
13+
'storage explosion problems',
14+
'materialized view performance',
15+
'database view optimization',
16+
'aggregation strategy',
17+
'materialized view troubleshooting',
18+
'view storage overhead'
19+
]
20+
title: 'Lessons - materialized views'
21+
description: 'Real world examples of materialized views, problems and solutions'
22+
---
23+
24+
# Materialized views: how they can become a double edged sword {#materialized-views-the-double-edged-sword}
25+
26+
*This guide is part of a collection of findings gained from community meetups. For more real world solutions and insights you can [browse by specific problem](./community-wisdom.md).*
27+
*Too many parts bogging your database down? Check out the [Too Many Parts](./too-many-parts.md) community insights guide.*
28+
*Learn more about [Materialized Views](/materialized-views).*
29+
30+
## The 10x storage anti-pattern {#storage-antipattern}
31+
32+
**Real production problem:** *"We had a materialized view. The raw log table was around 20 gig but the view from that log table exploded to 190 gig, so almost 10x the size of the raw table. This happened because we were creating one row per attribute and each log can have 10 attributes."*
33+
34+
**Rule:** If your `GROUP BY` creates more rows than it eliminates, you're building an expensive index, not a materialized view.
35+
36+
## Production materialized view health validation {#mv-health-validation}
37+
38+
This query helps you predict whether a materialized view will compress or explode your data before you create it. Run it against your actual table and columns to avoid the "190GB explosion" scenario.
39+
40+
**What it shows:**
41+
- **Low aggregation ratio** (\<10%) = Good MV, significant compression
42+
- **High aggregation ratio** (\>70%) = Bad MV, storage explosion risk
43+
- **Storage multiplier** = How much bigger/smaller your MV will be
44+
45+
```sql
46+
-- Replace with your actual table and columns
47+
SELECT
48+
count() as total_rows,
49+
uniq(your_group_by_columns) as unique_combinations,
50+
round(uniq(your_group_by_columns) / count() * 100, 2) as aggregation_ratio
51+
FROM your_table
52+
WHERE your_filter_conditions;
53+
54+
-- If aggregation_ratio > 70%, reconsider your MV design
55+
-- If aggregation_ratio < 10%, you'll get good compression
56+
```
57+
58+
## When materialized views become a problem {#mv-problems}
59+
60+
**Warning signs to monitor:**
61+
- Insert latency increases (queries that took 10ms now take 100ms+)
62+
- "Too many parts" errors appearing more frequently
63+
- CPU spikes during insert operations
64+
- Insert timeouts that didn't happen before
65+
66+
You can compare insert performance before and after adding MVs using `system.query_log` to track query duration trends.
67+
68+
## Video sources {#video-sources}
69+
- [ClickHouse at CommonRoom - Kirill Sapchuk](https://www.youtube.com/watch?v=liTgGiTuhJE) - Source of the "over enthusiastic about materialized views" and "20GB→190GB explosion" case study

0 commit comments

Comments
 (0)