CC caused "disk is full" for one of logDirs for broker during rebalancing. #1590

emelyanovtv · 2021-06-16T09:19:15Z

Description:

We got the error: disk is full during rebalancing. Basically, we have 2 logs dirs per broker, which have the same size. But when rebalancing was running, we noticed that only one disk (log dir) for the broker has been filled with new data. The disk capacity (will be described below) for the log dir /var/dirs/kafka/data/topics was set to 3584000 Mb but once it finished with error and after we increased disk size for this specif log dir (out of disk space) became 3644675 Mb. Why can such kinds of things happen? Can you help us to have more clear explanations for this error?

The main assumption why this happened is that we moved almost 11 Tb of data among brokers and it took 2 days. Perhaps it can be root cause for this error.

The steps how it was:

Run rebalancing
Hit the limit for one broker and one of the log dirs has been full. Size for the failed broker and on of the log dir (let's say broker-1 /var/dirs/kafka/data/topics ) was 3584000 Mb
We increased the size for this specific log dir
Restarted the broker
Everything has been up successfully, the size for the same log dir (broker-1 /var/dirs/kafka/data/topic) became 3644675 Mb

Question:

Is it happened to anyone before, and how I can avoid that?

Current setup

CC: 2.0.168
kafka: 2.3.0
execution task has been triggered POST /kafkacruisecontrol/rebalance?json=true&dryrun=false&concurrent_partition_movements_per_broker=4&concurrent_leader_movements=10
total duration before it was failed took almost 50 hours (179776 sec)
each broker has 2 log dirs.
settings

cruisecontrol.properties:

capacity.config.file=config/capacity.json

capacity.json (for all brokers we have the same settings as below)

{
              "brokerId": "N",
              "capacity": {
                  "DISK": {
                    "/var/dirs/kafka/data/topics": "3584000",
                    "/var/dirs/kafka/data1/topics": "3584000"
                  },
                  "CPU": "100",
                  "NW_IN": "10000",
                  "NW_OUT": "10000"
              },
              "doc": "Capacity unit used for disk is in MB, cpu is in percentage, network throughput is in KB."
          }

If you need more info from me I'll share it with you easilly.

The text was updated successfully, but these errors were encountered:

emelyanovtv · 2021-06-28T06:50:27Z

Is anybody can help with some assumptions or something? Because I'm running out of an ideas.

efeg · 2021-07-01T01:08:03Z

@emelyanovtv I am curious what the load endpoint of Cruise Control shows with populate_disk_info=true.
I wonder if the same disk is used by other services, causing the capacity to be used by not only Kafka, but also some other service.

emelyanovtv · 2021-07-01T13:26:09Z

those disks are only for kafka brokers (dedicated).
BTW, after we got this error I did rebalance manually. I'll post part of data for brokers, but I checked and everything seems to me valid.

                       HOST         BROKER          RACK                               LOGDIR        DISK_CAP(MB)            DISK(MB)/_(%)_            CORE_NUM         CPU(%)          NW_IN_CAP(KB/s)       LEADER_NW_IN(KB/s)     FOLLOWER_NW_IN(KB/s)         NW_OUT_CAP(KB/s)        NW_OUT(KB/s)       PNW_OUT(KB/s)    LEADERS/REPLICAS
 kafka-0.broker,           200,rack-b,                                             7168000.000,        5019532.000/70.03,                  1,        71.517,               10000.000,                 223.701,                 463.693,               10000.000,           1221.138,           3717.967,           366/1128
                                                                                  /var/dirs/kafka/data/topics,        3005975.379/83.87,                                                                                                                    320/992
                                                                                 /var/dirs/kafka/data1/topics,        2013565.704/56.18,                                                                                                                     46/136
 kafka-1.broker,           201,rack-c,                                             7168000.000,        4524581.500/63.12,                  1,        77.026,               10000.000,                 225.302,                 519.606,               10000.000,           1261.675,           4125.801,           358/1121
                                                                                  /var/dirs/kafka/data/topics,        2928987.955/81.72,                                                                                                                    326/1040
                                                                                 /var/dirs/kafka/data1/topics,        1595596.534/44.52,                                                                                                                     32/81
 kafka-2.broker,           202,rack-a,                                             7168000.000,        4736796.500/66.08,                  1,        66.809,               10000.000,                 233.720,                 509.744,               10000.000,           1282.256,           4090.291,           419/1097
                                                                                  /var/dirs/kafka/data/topics,        2971493.640/82.91,                                                                                                                    388/986
                                                                                 /var/dirs/kafka/data1/topics,        1765304.770/49.26,                                                                                                                     31/111

emelyanovtv · 2021-07-12T12:34:44Z

@efeg any ideas?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CC caused "disk is full" for one of logDirs for broker during rebalancing. #1590

CC caused "disk is full" for one of logDirs for broker during rebalancing. #1590

emelyanovtv commented Jun 16, 2021 •

edited

Loading

emelyanovtv commented Jun 28, 2021

efeg commented Jul 1, 2021

emelyanovtv commented Jul 1, 2021

emelyanovtv commented Jul 12, 2021

CC caused "disk is full" for one of logDirs for broker during rebalancing. #1590

CC caused "disk is full" for one of logDirs for broker during rebalancing. #1590

Comments

emelyanovtv commented Jun 16, 2021 • edited Loading

Description:

The steps how it was:

Question:

Current setup

emelyanovtv commented Jun 28, 2021

efeg commented Jul 1, 2021

emelyanovtv commented Jul 1, 2021

emelyanovtv commented Jul 12, 2021

emelyanovtv commented Jun 16, 2021 •

edited

Loading