|
| 1 | +--- |
| 2 | +sidebar_position: 1 |
| 3 | +slug: /community-wisdom/debugging-insights |
| 4 | +sidebar_label: 'Debugging Insights' |
| 5 | +doc_type: 'how-to-guide' |
| 6 | +keywords: [ |
| 7 | + 'clickhouse troubleshooting', |
| 8 | + 'clickhouse errors', |
| 9 | + 'slow queries', |
| 10 | + 'memory problems', |
| 11 | + 'connection issues', |
| 12 | + 'performance optimization', |
| 13 | + 'database errors', |
| 14 | + 'configuration problems', |
| 15 | + 'debug', |
| 16 | + 'solutions' |
| 17 | +] |
| 18 | +title: 'Lessons - debugging insights' |
| 19 | +description: 'Find solutions to the most common ClickHouse problems including slow queries, memory errors, connection issues, and configuration problems.' |
| 20 | +--- |
| 21 | + |
| 22 | +# ClickHouse operations: community debugging insights {#clickhouse-operations-community-debugging-insights} |
| 23 | +*This guide is part of a collection of findings gained from community meetups. For more real world solutions and insights you can [browse by specific problem](./community-wisdom.md).* |
| 24 | +*Suffering from high operational costs? Check out the [Cost Optimization](./cost-optimization.md) community insights guide.* |
| 25 | + |
| 26 | +## Essential system tables {#essential-system-tables} |
| 27 | + |
| 28 | +These system tables are fundamental for production debugging: |
| 29 | + |
| 30 | +### system.errors {#system-errors} |
| 31 | + |
| 32 | +Shows all active errors in your ClickHouse instance. |
| 33 | + |
| 34 | +```sql |
| 35 | +SELECT name, value, changed |
| 36 | +FROM system.errors |
| 37 | +WHERE value > 0 |
| 38 | +ORDER BY value DESC; |
| 39 | +``` |
| 40 | + |
| 41 | +### system.replicas {#system-replicas} |
| 42 | + |
| 43 | +Contains replication lag and status information for monitoring cluster health. |
| 44 | + |
| 45 | +```sql |
| 46 | +SELECT database, table, replica_name, absolute_delay, queue_size, inserts_in_queue |
| 47 | +FROM system.replicas |
| 48 | +WHERE absolute_delay > 60 |
| 49 | +ORDER BY absolute_delay DESC; |
| 50 | +``` |
| 51 | + |
| 52 | +### system.replication_queue {#system-replication-queue} |
| 53 | + |
| 54 | +Provides detailed information for diagnosing replication problems. |
| 55 | + |
| 56 | +```sql |
| 57 | +SELECT database, table, replica_name, position, type, create_time, last_exception |
| 58 | +FROM system.replication_queue |
| 59 | +WHERE last_exception != '' |
| 60 | +ORDER BY create_time DESC; |
| 61 | +``` |
| 62 | + |
| 63 | +### system.merges {#system-merges} |
| 64 | + |
| 65 | +Shows current merge operations and can identify stuck processes. |
| 66 | + |
| 67 | +```sql |
| 68 | +SELECT database, table, elapsed, progress, is_mutation, total_size_bytes_compressed |
| 69 | +FROM system.merges |
| 70 | +ORDER BY elapsed DESC; |
| 71 | +``` |
| 72 | + |
| 73 | +### system.parts {#system-parts} |
| 74 | + |
| 75 | +Essential for monitoring part counts and identifying fragmentation issues. |
| 76 | + |
| 77 | +```sql |
| 78 | +SELECT database, table, count() as part_count |
| 79 | +FROM system.parts |
| 80 | +WHERE active = 1 |
| 81 | +GROUP BY database, table |
| 82 | +ORDER BY count() DESC; |
| 83 | +``` |
| 84 | + |
| 85 | +## Common production issues {#common-production-issues} |
| 86 | + |
| 87 | +### Disk space problems {#disk-space-problems} |
| 88 | + |
| 89 | +Disk space exhaustion in replicated setups creates cascading problems. When one node runs out of space, other nodes continue trying to sync with it, causing network traffic spikes and confusing symptoms. One community member spent 4 hours debugging what was simply low disk space. Check out this [query](/knowledgebase/useful-queries-for-troubleshooting#show-disk-storage-number-of-parts-number-of-rows-in-systemparts-and-marks-across-databases) to monitor your disk storage on a particular cluster. |
| 90 | + |
| 91 | +AWS users should be aware that default general purpose EBS volumes have a 16TB limit. |
| 92 | + |
| 93 | +### Too many parts error {#too-many-parts-error} |
| 94 | + |
| 95 | +Small frequent inserts create performance problems. The community has identified that insert rates above 10 per second often trigger "too many parts" errors because ClickHouse cannot merge parts fast enough. |
| 96 | + |
| 97 | +**Solutions:** |
| 98 | +- Batch data using 30-second or 200MB thresholds |
| 99 | +- Enable async_insert for automatic batching |
| 100 | +- Use buffer tables for server-side batching |
| 101 | +- Configure Kafka for controlled batch sizes |
| 102 | + |
| 103 | +[Official recommendation](/best-practices/selecting-an-insert-strategy#batch-inserts-if-synchronous): minimum 1,000 rows per insert, ideally 10,000 to 100,000. |
| 104 | + |
| 105 | +### Invalid timestamps issues {#data-quality-issues} |
| 106 | + |
| 107 | +Applications that send data with arbitrary timestamps create partition problems. This leads to partitions with data from unrealistic dates (like 1998 or 2050), causing unexpected storage behavior. |
| 108 | + |
| 109 | +### `ALTER` operation risks {#alter-operation-risks} |
| 110 | + |
| 111 | +Large `ALTER` operations on multi-terabyte tables can consume significant resources and potentially lock databases. One community example involved changing an Integer to a Float on 14TB of data, which locked the entire database and required rebuilding from backups. |
| 112 | + |
| 113 | +**Monitor expensive mutations:** |
| 114 | + |
| 115 | +```sql |
| 116 | +SELECT database, table, mutation_id, command, parts_to_do, is_done |
| 117 | +FROM system.mutations |
| 118 | +WHERE is_done = 0; |
| 119 | +``` |
| 120 | + |
| 121 | +Test schema changes on smaller datasets first. |
| 122 | + |
| 123 | +## Memory and performance {#memory-and-performance} |
| 124 | + |
| 125 | +### External aggregation {#external-aggregation} |
| 126 | + |
| 127 | +Enable external aggregation for memory-intensive operations. It's slower but prevents out-of-memory crashes by spilling to disk. You can do this by using `max_bytes_before_external_group_by` which will help prevent out of memory crashes on large `GROUP BY` operations. You can learn more about this setting [here](/operations/settings/settings#max_bytes_before_external_group_by). |
| 128 | + |
| 129 | +```sql |
| 130 | +SELECT |
| 131 | + column1, |
| 132 | + column2, |
| 133 | + COUNT(*) as count, |
| 134 | + SUM(value) as total |
| 135 | +FROM large_table |
| 136 | +GROUP BY column1, column2 |
| 137 | +SETTINGS max_bytes_before_external_group_by = 1000000000; -- 1GB threshold |
| 138 | +``` |
| 139 | + |
| 140 | +### Async insert details {#async-insert-details} |
| 141 | + |
| 142 | +Async insert automatically batches small inserts server-side to improve performance. You can configure whether to wait for data to be written to disk before returning acknowledgment - immediate return is faster but less durable. Modern versions support deduplication to handle duplicate data within batches. |
| 143 | + |
| 144 | +**Related docs** |
| 145 | +- [Selecting an insert strategy](/best-practices/selecting-an-insert-strategy#asynchronous-inserts) |
| 146 | + |
| 147 | +### Distributed table configuration {#distributed-table-configuration} |
| 148 | + |
| 149 | +By default, distributed tables use single-threaded inserts. Enable `insert_distributed_sync` for parallel processing and immediate data sending to shards. |
| 150 | + |
| 151 | +Monitor temporary data accumulation when using distributed tables. |
| 152 | + |
| 153 | +### Performance monitoring thresholds {#performance-monitoring-thresholds} |
| 154 | + |
| 155 | +Community-recommended monitoring thresholds: |
| 156 | +- Parts per partition: preferably less than 100 |
| 157 | +- Delayed inserts: should stay at zero |
| 158 | +- Insert rate: limit to about 1 per second for optimal performance |
| 159 | + |
| 160 | +**Related docs** |
| 161 | +- [Custom partitioning key](/engines/table-engines/mergetree-family/custom-partitioning-key) |
| 162 | + |
| 163 | +## Quick reference {#quick-reference} |
| 164 | + |
| 165 | +| Issue | Detection | Solution | |
| 166 | +|-------|-----------|----------| |
| 167 | +| Disk Space | Check `system.parts` total bytes | Monitor usage, plan scaling | |
| 168 | +| Too Many Parts | Count parts per table | Batch inserts, enable async_insert | |
| 169 | +| Replication Lag | Check `system.replicas` delay | Monitor network, restart replicas | |
| 170 | +| Bad Data | Validate partition dates | Implement timestamp validation | |
| 171 | +| Stuck Mutations | Check `system.mutations` status | Test on small data first | |
| 172 | + |
| 173 | +### Video sources {#video-sources} |
| 174 | +- [10 Lessons from Operating ClickHouse](https://www.youtube.com/watch?v=liTgGiTuhJE) |
| 175 | +- [Fast, Concurrent, and Consistent Asynchronous INSERTS in ClickHouse](https://www.youtube.com/watch?v=AsMPEfN5QtM) |
0 commit comments