Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kafka/server: add metrics and config for consumer lag reporting #24977

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

IoannisRP
Copy link
Contributor

@IoannisRP IoannisRP commented Jan 29, 2025

Implements: https://redpandadata.atlassian.net/browse/CORE-8914

Introduce "enable_consumer_group_lag_metrics" which controls whether the consumer lag metrics are active. This can be changed without needing a restart.

Introduce the metrics scaffolding needed to have metrics that can be enabled/disabled at runtime.

Metric Type Description Labels Aggregation labels
redpanda_kafka_consumer_group_lag_max gauge Maximum consumer group lag across all partitions in a group group, shard
redpanda_kafka_consumer_group_lag_sum gauge Sum of consumer group lag for all partitions in a group group, shard

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

@IoannisRP IoannisRP requested review from BenPope and a team January 29, 2025 15:13
@IoannisRP IoannisRP requested a review from a team as a code owner January 29, 2025 15:13
@IoannisRP IoannisRP changed the title Core 8914/consumer lag config kafka/server: add metrics and config for consumer lag reporting Jan 29, 2025
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 29, 2025

CI test results

test results on build#61359
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61359#0194b2e4-6b16-498b-a3a6-1735b6488617 FLAKY 1/2
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=False.with_tiered_storage=False.with_iceberg=True.with_chunked_compaction=True.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/61359#0194b2e9-78d9-4746-a266-103519385a06 FLAKY 1/2
test results on build#61509
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61509#0194cc8d-8071-43c3-93d5-4cffec6c21a1 FLAKY 1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/61509#0194cc8d-8071-4d90-8b79-c330b1e751c3 FLAKY 1/3

Copy link
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

src/v/kafka/server/group.h Show resolved Hide resolved
tests/rptest/tests/consumer_group_test.py Outdated Show resolved Hide resolved
src/v/kafka/server/group_probe.h Outdated Show resolved Hide resolved
Comment on lines +170 to +174
"Sum of consumer group lag for all partitions in a group"),
labels),
sm::make_gauge(
"lag_max",
[this] { return _lag_metrics.max; },
sm::description(
"Maximum consumer group lag across all partitions in a group"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe one for docs, but topic-partitions might be more easily understood.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this?

Suggested change
"Sum of consumer group lag for all partitions in a group"),
labels),
sm::make_gauge(
"lag_max",
[this] { return _lag_metrics.max; },
sm::description(
"Maximum consumer group lag across all partitions in a group"),
"Sum of consumer group lag for all topic-partitions"),
labels),
sm::make_gauge(
"lag_max",
[this] { return _lag_metrics.max; },
sm::description(
"Maximum consumer group lag across topic-partitions"),

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

src/v/kafka/server/group_probe.h Outdated Show resolved Hide resolved
@@ -196,6 +196,7 @@ struct configuration final : public config_store {
property<bool> disable_metrics;
property<bool> disable_public_metrics;
property<bool> aggregate_metrics;
property<bool> enable_consumer_group_lag_metrics;
Copy link
Member

@BenPope BenPope Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have more than a bool here.

I see a use for:

  • partition-level metrics (what we have with redpanda_kafka_consumer_group_committed_offset)
  • consumer_lag (this)
  • both

enable_group_metrics is hard-coded as on (off in some tests), so it could be wrapped up.

Perhaps enable_group_metrics with options partition, group. Not sure how to spell the option both, if, perhaps, we wanted to add topic one day. Maybe the accepted values could be something like partition|lag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like something maybe we can tackle in future metrics rework?

@IoannisRP IoannisRP force-pushed the CORE-8914/consumer_lag_config branch from 76291af to 53aa756 Compare February 3, 2025 13:36
config: Add consumer lag metric config

Note that this commit contains only the metric infrastructure, i.e. the
probe and the mechanism to dynamically enable/disable these metrics.
A subsequent commit will implement the logic to populate the consumer
lag metrics data.
@IoannisRP IoannisRP force-pushed the CORE-8914/consumer_lag_config branch from 53aa756 to b880e01 Compare February 3, 2025 14:38
@IoannisRP
Copy link
Contributor Author

Changes in force-push:

  • rebase to dev

Changes in force-push:

  • merged new config to commit that uses it
  • removed value read on construction for _disable_public_metrics
  • added comment in ducktape test wait_until

, enable_consumer_group_lag_metrics(
*this,
"enable_consumer_group_lag_metrics",
"Enable registering metrics for consumer group lag exposed on "
Copy link

@asimms41 asimms41 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Enable registering metrics for consumer group lag exposed on "
"Enable metrics for consumer group lag exposed on "

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants