Skip to content

4146 add troubleshooting section #4181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Aug 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
32feb0b
Troubleshooting page, temporary placement
dhtclk Jul 26, 2025
60f3fbe
Merge branch 'main' of https://github.com/ClickHouse/clickhouse-docs …
dhtclk Jul 26, 2025
c619204
Merge branch 'main' of https://github.com/ClickHouse/clickhouse-docs …
dhtclk Jul 28, 2025
aa15a8a
Merge branch 'main' of https://github.com/ClickHouse/clickhouse-docs …
dhtclk Aug 1, 2025
0c7d189
Merge branch 'main' of https://github.com/ClickHouse/clickhouse-docs …
dhtclk Aug 4, 2025
7c94a46
Adding Lessons Learned Guide with Interactable Queries
dhtclk Aug 6, 2025
9166323
Split into multiple guides under a new section
dhtclk Aug 6, 2025
2bbe1bc
Merge branch 'main' of https://github.com/ClickHouse/clickhouse-docs …
dhtclk Aug 7, 2025
bf67a27
Keywords, cross-linking, clean-up
dhtclk Aug 7, 2025
e10f908
adding ask ai link to troubleshooting, simple kapa link component.
dhtclk Aug 7, 2025
dc751b2
commenting out C++ link
dhtclk Aug 7, 2025
0ba3eb4
spelling and dictionary update
dhtclk Aug 7, 2025
e9a6adf
Update docs/tips-and-tricks/too-many-parts.md
dhtclk Aug 8, 2025
0f7d07e
Update docs/tips-and-tricks/too-many-parts.md
dhtclk Aug 8, 2025
7159177
Update docs/tips-and-tricks/too-many-parts.md
dhtclk Aug 8, 2025
03abb7d
Update docs/tips-and-tricks/too-many-parts.md
dhtclk Aug 8, 2025
8e8fd7b
Update docs/tips-and-tricks/debugging-toolkit.md
dhtclk Aug 8, 2025
4867bbe
Update docs/tips-and-tricks/cost-optimization.md
dhtclk Aug 8, 2025
ec2036f
Merge branch 'main' of https://github.com/ClickHouse/clickhouse-docs …
dhtclk Aug 11, 2025
de5e966
Rewriting Creative Use Cases
dhtclk Aug 11, 2025
21e518c
fix formatting
dhtclk Aug 11, 2025
33935f4
Rewrite cost-optimization doc
dhtclk Aug 11, 2025
f3806da
Performance Optimization Guide
dhtclk Aug 11, 2025
d8e3ca2
Too Many Parts
dhtclk Aug 11, 2025
395d484
MVs and Debugging Toolkit
dhtclk Aug 11, 2025
aa308be
Fixing nav link
dhtclk Aug 11, 2025
bc0e0eb
adding to dictionary
dhtclk Aug 11, 2025
1dd1100
fixing dictionary
dhtclk Aug 11, 2025
f37c70a
adding header ids
dhtclk Aug 11, 2025
c0cf0fa
removing garbage AI quotes
dhtclk Aug 12, 2025
d546215
removing another garbage quote and fixing capitalization
dhtclk Aug 12, 2025
6e98868
fixing another quote
dhtclk Aug 12, 2025
12fb3ef
rewriting debugging insights
dhtclk Aug 12, 2025
500d515
slight header change
dhtclk Aug 12, 2025
fc5d1a1
adding header ids
dhtclk Aug 12, 2025
267527c
Further pruning innaccuracies and renaming debugging toolkit
dhtclk Aug 12, 2025
32e1d78
Update docs/tips-and-tricks/community-wisdom.md
dhtclk Aug 13, 2025
2ef96e4
Update docs/tips-and-tricks/community-wisdom.md
dhtclk Aug 13, 2025
fa9d8e3
Update docs/tips-and-tricks/community-wisdom.md
dhtclk Aug 13, 2025
d337d25
Update docs/tips-and-tricks/cost-optimization.md
dhtclk Aug 13, 2025
765216f
Update docs/tips-and-tricks/cost-optimization.md
dhtclk Aug 13, 2025
7c913dc
Apply suggestions from code review
dhtclk Aug 13, 2025
3083430
Apply suggestions from code review
dhtclk Aug 13, 2025
cc2bab0
Apply suggestions from code review
dhtclk Aug 13, 2025
98c1bbd
Apply suggestions from code review
dhtclk Aug 13, 2025
d5b9dfb
Fixing sentence casing, adding new line at the end of tsx file per co…
dhtclk Aug 13, 2025
4c0ab4d
Merge branch '4146-add-troubleshooting-section' of https://github.com…
dhtclk Aug 13, 2025
a3430f3
Improvements/rework based on Lio's suggestions, removed unhelpful que…
dhtclk Aug 14, 2025
ad87ffe
Angry spellchecker
dhtclk Aug 14, 2025
88e4956
more spelling errors
dhtclk Aug 14, 2025
8da0fa9
okay, now I'm just annoyed
dhtclk Aug 14, 2025
e699a22
minor tweak to header in too many parts
dhtclk Aug 14, 2025
f1d0caa
Adding to too many parts
dhtclk Aug 14, 2025
1c854a8
fixed troubleshooting color issue, moved troubleshooting. Removing bl…
dhtclk Aug 18, 2025
a331284
adding missing file
dhtclk Aug 19, 2025
145827f
Merge branch 'main' of https://github.com/ClickHouse/clickhouse-docs …
dhtclk Aug 21, 2025
fc9ac78
PR fixes debugging insights and troubleshooting
dhtclk Aug 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions docs/tips-and-tricks/community-wisdom.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
sidebar_position: 1
slug: /tips-and-tricks/community-wisdom
sidebar_label: 'Community Wisdom'
doc_type: 'overview'
keywords: [
'database tips',
'community wisdom',
'production troubleshooting',
'performance optimization',
'database debugging',
'clickhouse guides',
'real world examples',
'database best practices',
'meetup insights',
'production lessons',
'interactive tutorials',
'database solutions'
]
title: 'ClickHouse community wisdom'
description: 'Learn from the ClickHouse community with real world scenarios and lessons learned'
---

# ClickHouse community wisdom: tips and tricks from meetups {#community-wisdom}

*These interactive guides represent collective wisdom from hundreds of production deployments. Each runnable example helps you understand ClickHouse patterns using real GitHub events data - practice these concepts to avoid common mistakes and accelerate your success.*

Combine this collected knowledge with our [Best Practices](/best-practices) guide for optimal ClickHouse Experience.

## Problem-specific quick jumps {#problem-specific-quick-jumps}

| Issue | Document | Description |
|-------|---------|-------------|
| **Production issue** | [Debugging insights](./debugging-insights.md) | Community production debugging tips |
| **Slow queries** | [Performance optimization](./performance-optimization.md) | Optimize Performance |
| **Materialized views** | [MV double-edged sword](./materialized-views.md) | Avoid 10x storage instances |
| **Too many parts** | [Too many parts](./too-many-parts.md) | Addressing the 'Too Many Parts' error and performance slowdown |
| **High costs** | [Cost optimization](./cost-optimization.md) | Optimize Cost |
| **Success stories** | [Success stories](./success-stories.md) | Examples of ClickHouse in successful use cases |

**Last Updated:** Based on community meetup insights through 2024-2025
**Contributing:** Found a mistake or have a new lesson? Community contributions welcome
94 changes: 94 additions & 0 deletions docs/tips-and-tricks/cost-optimization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
sidebar_position: 1
slug: /community-wisdom/cost-optimization
sidebar_label: 'Cost Optimization'
doc_type: 'how-to-guide'
keywords: [
'cost optimization',
'storage costs',
'partition management',
'data retention',
'storage analysis',
'database optimization',
'clickhouse cost reduction',
'storage hot spots',
'ttl performance',
'disk usage',
'compression strategies',
'retention analysis'
]
title: 'Lessons - cost optimization'
description: 'Cost optimization strategies from ClickHouse community meetups with real production examples and verified techniques.'
---

# Cost optimization: strategies from the community {#cost-optimization}
*This guide is part of a collection of findings gained from community meetups. The findings on this page cover community wisdom related to optimizing cost while using ClickHouse that worked well for their specific experience and setup. For more real world solutions and insights you can [browse by specific problem](./community-wisdom.md).*

*Learn about how [ClickHouse Cloud can help manage operational costs](/cloud/overview)*.

## Compression strategy: LZ4 vs ZSTD in production {#compression-strategy}

When Microsoft Clarity needed to handle hundreds of terabytes of data, they discovered that compression choices have dramatic cost implications. At their scale, every bit of storage savings matters, and they faced a classic trade-off: performance versus storage costs. Microsoft Clarity handles massive volumes—two petabytes of uncompressed data per month across all accounts, processing around 60,000 queries per hour across eight nodes and serving billions of page views from millions of websites. At this scale, compression strategy becomes a critical cost factor.

They initially used ClickHouse's default [LZ4](/sql-reference/statements/create/table#lz4) compression but discovered significant cost savings were possible with [ZSTD](/sql-reference/statements/create/table#zstd). While LZ4 is faster, ZSTD provides better compression at the cost of slightly slower performance. After testing both approaches, they made a strategic decision to prioritize storage savings. The results were significant: 50% storage savings on large tables with manageable performance impact on ingestion and queries.

**Key results:**
- 50% storage savings on large tables through ZSTD compression
- 2 petabytes monthly data processing capacity
- Manageable performance impact on ingestion and queries
- Significant cost reduction at hundreds of TB scale

## Column-based retention strategy {#column-retention}

One of the most powerful cost optimization techniques comes from analyzing which columns are actually being used. Microsoft Clarity implements sophisticated column-based retention strategies using ClickHouse's built-in telemetry capabilities. ClickHouse provides detailed metrics on storage usage by column as well as comprehensive query patterns: which columns are accessed, how frequently, query duration, and overall usage statistics.

This data-driven approach enables strategic decisions about retention policies and column lifecycle management. By analyzing this telemetry data, Microsoft can identify storage hot spots - columns that consume significant space but receive minimal queries. For these low-usage columns, they can implement aggressive retention policies, reducing storage time from 30 months to just one month, or delete the columns entirely if they're not queried at all. This selective retention strategy reduces storage costs without impacting user experience.

**The strategy:**
- Analyze column usage patterns using ClickHouse telemetry
- Identify high-storage, low-query columns
- Implement selective retention policies
- Monitor query patterns for data-driven decisions

**Related docs**
- [Managing Data - Column Level TTL](/observability/managing-data)

## Partition-based data management {#partition-management}

Microsoft Clarity discovered that partitioning strategy impacts both performance and operational simplicity. Their approach: partition by date, order by hour. This strategy delivers multiple benefits beyond just cleanup efficiency—it enables trivial data cleanup, simplifies billing calculations for their customer-facing service, and supports GDPR compliance requirements for row-based deletion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to resources on how to manage partitions in ClickHouse


**Key benefits:**
- Trivial data cleanup (drop partition vs row-by-row deletion)
- Simplified billing calculations
- Better query performance through partition elimination
- Easier operational management

**Related docs**
- [Managing Data - Partitions](/observability/managing-data#partitions)

## String-to-integer conversion strategy {#string-integer-conversion}

Analytics platforms often face a storage challenge with categorical data that appears repeatedly across millions of rows. Microsoft's engineering team encountered this problem with their search analytics data and developed an effective solution that achieved 60% storage reduction on affected datasets.

In Microsoft's web analytics system, search results trigger different types of answers - weather cards, sports information, news articles, and factual responses. Each query result was tagged with descriptive strings like "weather_answer," "sports_answer," or "factual_answer." With billions of search queries processed, these string values were being stored repeatedly in ClickHouse, consuming massive amounts of storage space and requiring expensive string comparisons during queries.

Microsoft implemented a string-to-integer mapping system using a separate MySQL database. Instead of storing the actual strings in ClickHouse, they store only integer IDs. When users run queries through the UI and request data for `weather_answer`, their query optimizer first consults the MySQL mapping table to get the corresponding integer ID, then converts the query to use that integer before sending it to ClickHouse.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the mapping solution could be implemented using a Dictionary here instead of MySQL. I understand we want to share the story as-is from the customer, but maybe we could suggest "better" solution if exists in ClickHouse.


This architecture preserves the user experience - people still see meaningful labels like `weather_answer` in their dashboards - while the backend storage and queries operate on much more efficient integers. The mapping system handles all translation transparently, requiring no changes to the user interface or user workflows.

**Key benefits:**
- 60% storage reduction on affected datasets
- Faster query performance on integer comparisons
- Reduced memory usage for joins and aggregations
- Lower network transfer costs for large result sets

:::note
This is a an example specifically used for Microsoft Clarity's data scenario. If you have all your data in ClickHouse or do not have constraints against moving data to ClickHouse, try using [dictionaries](/dictionary) instead.
:::

## Video sources {#video-sources}

- **[Microsoft Clarity and ClickHouse](https://www.youtube.com/watch?v=rUVZlquVGw0)** - Microsoft Clarity Team
- **[ClickHouse journey in Contentsquare](https://www.youtube.com/watch?v=zvuCBAl2T0Q)** - Doron Hoffman & Guram Sigua (ContentSquare)

*These community cost optimization insights represent strategies from companies processing hundreds of terabytes to petabytes of data, showing real-world approaches to reducing ClickHouse operational costs.*
175 changes: 175 additions & 0 deletions docs/tips-and-tricks/debugging-insights.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
---
sidebar_position: 1
slug: /community-wisdom/debugging-insights
sidebar_label: 'Debugging Insights'
doc_type: 'how-to-guide'
keywords: [
'clickhouse troubleshooting',
'clickhouse errors',
'slow queries',
'memory problems',
'connection issues',
'performance optimization',
'database errors',
'configuration problems',
'debug',
'solutions'
]
title: 'Lessons - debugging insights'
description: 'Find solutions to the most common ClickHouse problems including slow queries, memory errors, connection issues, and configuration problems.'
---

# ClickHouse operations: community debugging insights {#clickhouse-operations-community-debugging-insights}
*This guide is part of a collection of findings gained from community meetups. For more real world solutions and insights you can [browse by specific problem](./community-wisdom.md).*
*Suffering from high operational costs? Check out the [Cost Optimization](./cost-optimization.md) community insights guide.*

## Essential system tables {#essential-system-tables}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add data samples of what those queries and how to use the result or act on it? Maybe some of them can be illustrated with examples from the section Common production issues


These system tables are fundamental for production debugging:

### system.errors {#system-errors}

Shows all active errors in your ClickHouse instance.

```sql
SELECT name, value, changed
FROM system.errors
WHERE value > 0
ORDER BY value DESC;
```

### system.replicas {#system-replicas}

Contains replication lag and status information for monitoring cluster health.

```sql
SELECT database, table, replica_name, absolute_delay, queue_size, inserts_in_queue
FROM system.replicas
WHERE absolute_delay > 60
ORDER BY absolute_delay DESC;
```

### system.replication_queue {#system-replication-queue}

Provides detailed information for diagnosing replication problems.

```sql
SELECT database, table, replica_name, position, type, create_time, last_exception
FROM system.replication_queue
WHERE last_exception != ''
ORDER BY create_time DESC;
```

### system.merges {#system-merges}

Shows current merge operations and can identify stuck processes.

```sql
SELECT database, table, elapsed, progress, is_mutation, total_size_bytes_compressed
FROM system.merges
ORDER BY elapsed DESC;
```

### system.parts {#system-parts}

Essential for monitoring part counts and identifying fragmentation issues.

```sql
SELECT database, table, count() as part_count
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY count() DESC;
```

## Common production issues {#common-production-issues}

### Disk space problems {#disk-space-problems}

Disk space exhaustion in replicated setups creates cascading problems. When one node runs out of space, other nodes continue trying to sync with it, causing network traffic spikes and confusing symptoms. One community member spent 4 hours debugging what was simply low disk space. Check out this [query](/knowledgebase/useful-queries-for-troubleshooting#show-disk-storage-number-of-parts-number-of-rows-in-systemparts-and-marks-across-databases) to monitor your disk storage on a particular cluster.

AWS users should be aware that default general purpose EBS volumes have a 16TB limit.

### Too many parts error {#too-many-parts-error}

Small frequent inserts create performance problems. The community has identified that insert rates above 10 per second often trigger "too many parts" errors because ClickHouse cannot merge parts fast enough.

**Solutions:**
- Batch data using 30-second or 200MB thresholds
- Enable async_insert for automatic batching
- Use buffer tables for server-side batching
- Configure Kafka for controlled batch sizes

[Official recommendation](/best-practices/selecting-an-insert-strategy#batch-inserts-if-synchronous): minimum 1,000 rows per insert, ideally 10,000 to 100,000.

### Invalid timestamps issues {#data-quality-issues}

Applications that send data with arbitrary timestamps create partition problems. This leads to partitions with data from unrealistic dates (like 1998 or 2050), causing unexpected storage behavior.

### `ALTER` operation risks {#alter-operation-risks}

Large `ALTER` operations on multi-terabyte tables can consume significant resources and potentially lock databases. One community example involved changing an Integer to a Float on 14TB of data, which locked the entire database and required rebuilding from backups.

**Monitor expensive mutations:**

```sql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this query prevents locking the database in case of expensive mutation. This query simply monitor the progress of on-going mutations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited to monitoring mutations

SELECT database, table, mutation_id, command, parts_to_do, is_done
FROM system.mutations
WHERE is_done = 0;
```

Test schema changes on smaller datasets first.

## Memory and performance {#memory-and-performance}

### External aggregation {#external-aggregation}

Enable external aggregation for memory-intensive operations. It's slower but prevents out-of-memory crashes by spilling to disk. You can do this by using `max_bytes_before_external_group_by` which will help prevent out of memory crashes on large `GROUP BY` operations. You can learn more about this setting [here](/operations/settings/settings#max_bytes_before_external_group_by).

```sql
SELECT
column1,
column2,
COUNT(*) as count,
SUM(value) as total
FROM large_table
GROUP BY column1, column2
SETTINGS max_bytes_before_external_group_by = 1000000000; -- 1GB threshold
```

### Async insert details {#async-insert-details}

Async insert automatically batches small inserts server-side to improve performance. You can configure whether to wait for data to be written to disk before returning acknowledgment - immediate return is faster but less durable. Modern versions support deduplication to handle duplicate data within batches.

**Related docs**
- [Selecting an insert strategy](/best-practices/selecting-an-insert-strategy#asynchronous-inserts)

### Distributed table configuration {#distributed-table-configuration}

By default, distributed tables use single-threaded inserts. Enable `insert_distributed_sync` for parallel processing and immediate data sending to shards.

Monitor temporary data accumulation when using distributed tables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we share a query to monitor the temporary data accumulation?


### Performance monitoring thresholds {#performance-monitoring-thresholds}

Community-recommended monitoring thresholds:
- Parts per partition: preferably less than 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any official best practice for parts per partition?

- Delayed inserts: should stay at zero
- Insert rate: limit to about 1 per second for optimal performance

**Related docs**
- [Custom partitioning key](/engines/table-engines/mergetree-family/custom-partitioning-key)

## Quick reference {#quick-reference}

| Issue | Detection | Solution |
|-------|-----------|----------|
| Disk Space | Check `system.parts` total bytes | Monitor usage, plan scaling |
| Too Many Parts | Count parts per table | Batch inserts, enable async_insert |
| Replication Lag | Check `system.replicas` delay | Monitor network, restart replicas |
| Bad Data | Validate partition dates | Implement timestamp validation |
| Stuck Mutations | Check `system.mutations` status | Test on small data first |

### Video sources {#video-sources}
- [10 Lessons from Operating ClickHouse](https://www.youtube.com/watch?v=liTgGiTuhJE)
- [Fast, Concurrent, and Consistent Asynchronous INSERTS in ClickHouse](https://www.youtube.com/watch?v=AsMPEfN5QtM)
69 changes: 69 additions & 0 deletions docs/tips-and-tricks/materialized-views.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
sidebar_position: 1
slug: /tips-and-tricks/materialized-views
sidebar_label: 'Materialized Views'
doc_type: 'how-to'
keywords: [
'clickhouse materialized views',
'materialized view optimization',
'materialized view storage issues',
'materialized view best practices',
'database aggregation patterns',
'materialized view anti-patterns',
'storage explosion problems',
'materialized view performance',
'database view optimization',
'aggregation strategy',
'materialized view troubleshooting',
'view storage overhead'
]
title: 'Lessons - materialized views'
description: 'Real world examples of materialized views, problems and solutions'
---

# Materialized views: how they can become a double edged sword {#materialized-views-the-double-edged-sword}

*This guide is part of a collection of findings gained from community meetups. For more real world solutions and insights you can [browse by specific problem](./community-wisdom.md).*
*Too many parts bogging your database down? Check out the [Too Many Parts](./too-many-parts.md) community insights guide.*
*Learn more about [Materialized Views](/materialized-views).*

## The 10x storage anti-pattern {#storage-antipattern}

**Real production problem:** *"We had a materialized view. The raw log table was around 20 gig but the view from that log table exploded to 190 gig, so almost 10x the size of the raw table. This happened because we were creating one row per attribute and each log can have 10 attributes."*

**Rule:** If your `GROUP BY` creates more rows than it eliminates, you're building an expensive index, not a materialized view.

## Production materialized view health validation {#mv-health-validation}

This query helps you predict whether a materialized view will compress or explode your data before you create it. Run it against your actual table and columns to avoid the "190GB explosion" scenario.

**What it shows:**
- **Low aggregation ratio** (\<10%) = Good MV, significant compression
- **High aggregation ratio** (\>70%) = Bad MV, storage explosion risk
- **Storage multiplier** = How much bigger/smaller your MV will be

```sql
-- Replace with your actual table and columns
SELECT
count() as total_rows,
uniq(your_group_by_columns) as unique_combinations,
round(uniq(your_group_by_columns) / count() * 100, 2) as aggregation_ratio
FROM your_table
WHERE your_filter_conditions;

-- If aggregation_ratio > 70%, reconsider your MV design
-- If aggregation_ratio < 10%, you'll get good compression
```

## When materialized views become a problem {#mv-problems}

**Warning signs to monitor:**
- Insert latency increases (queries that took 10ms now take 100ms+)
- "Too many parts" errors appearing more frequently
- CPU spikes during insert operations
- Insert timeouts that didn't happen before

You can compare insert performance before and after adding MVs using `system.query_log` to track query duration trends.

## Video sources {#video-sources}
- [ClickHouse at CommonRoom - Kirill Sapchuk](https://www.youtube.com/watch?v=liTgGiTuhJE) - Source of the "over enthusiastic about materialized views" and "20GB→190GB explosion" case study
Loading