Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-8928: Introduce redpanda.iceberg.target.lag.ms topic propery #25056

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

oleiman
Copy link
Member

@oleiman oleiman commented Feb 6, 2025

No-op for now, but we'll use this once translation is ported over to the new scheduler.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Improvements

  • Introduce iceberg_target_lag_ms topic property

@oleiman oleiman self-assigned this Feb 6, 2025
@oleiman
Copy link
Member Author

oleiman commented Feb 6, 2025

/dt

@oleiman oleiman force-pushed the dlib/core-8928/target-lag-topic-prop branch from ddb8dc0 to be5cd6a Compare February 7, 2025 00:11
@oleiman oleiman changed the title cluster: iceberg_target_lag_ms topic property CORE-8928: Introduce redpanda.iceberg.target.lag.ms topic propery Feb 7, 2025
@oleiman oleiman force-pushed the dlib/core-8928/target-lag-topic-prop branch from be5cd6a to 6b6ed8e Compare February 7, 2025 00:35
@oleiman
Copy link
Member Author

oleiman commented Feb 7, 2025

/dt

@oleiman oleiman force-pushed the dlib/core-8928/target-lag-topic-prop branch from 6b6ed8e to 22a9bb3 Compare February 7, 2025 03:32
@oleiman
Copy link
Member Author

oleiman commented Feb 7, 2025

/dt

@oleiman oleiman force-pushed the dlib/core-8928/target-lag-topic-prop branch from 22a9bb3 to d6f0245 Compare February 7, 2025 06:04
@oleiman
Copy link
Member Author

oleiman commented Feb 7, 2025

/dt

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 7, 2025

Retry command for Build#61716

please wait until all jobs are finished before running the slash command



/ci-repeat 1
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_valid_settings

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 7, 2025

CI test results

test results on build#61716
test_id test_kind job_url test_status passed
rptest.tests.cluster_config_test.ClusterConfigTest.test_valid_settings ducktape https://buildkite.com/redpanda/redpanda/builds/61716#0194df68-ff14-44cc-baa5-cf24c04a8e79 FAIL 0/20
rptest.tests.cluster_config_test.ClusterConfigTest.test_valid_settings ducktape https://buildkite.com/redpanda/redpanda/builds/61716#0194df6c-990f-4d71-9059-8fd636b80940 FAIL 0/20
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61716#0194df68-ff15-45cb-9951-725a5547f25a FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61716#0194df68-ff16-421e-9761-d628ad70ddff FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP ducktape https://buildkite.com/redpanda/redpanda/builds/61716#0194df6c-9910-44ba-a87a-5957140c30b1 FLAKY 1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/61716#0194df6c-9910-44ba-a87a-5957140c30b1 FLAKY 1/2
rptest.transactions.producers_api_test.ProducersAdminAPITest.test_producers_state_api_during_load ducktape https://buildkite.com/redpanda/redpanda/builds/61716#0194df68-ff15-45cb-9951-725a5547f25a FLAKY 1/2
test results on build#61737
test_id test_kind job_url test_status passed
rptest.tests.cluster_config_test.ClusterConfigTest.test_valid_settings ducktape https://buildkite.com/redpanda/redpanda/builds/61737#0194e1a2-3591-4647-8f52-f5a2bfb9c69a FAIL 0/20
rptest.tests.cluster_config_test.ClusterConfigTest.test_valid_settings ducktape https://buildkite.com/redpanda/redpanda/builds/61737#0194e1a7-b163-45d0-8605-095bc1cb6458 FAIL 0/20
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61737#0194e1a2-3593-4acc-9b9a-b340d60a3243 FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC ducktape https://buildkite.com/redpanda/redpanda/builds/61737#0194e1a7-b162-4aee-a6bb-27aa58800718 FLAKY 1/3
rptest.transactions.consumer_offsets_test.VerifyConsumerOffsets.test_consumer_group_offsets ducktape https://buildkite.com/redpanda/redpanda/builds/61737#0194e1a2-3590-44d0-8ca0-3630c73dad33 FLAKY 1/3
test results on build#61744
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61744#0194e298-ec17-47b0-8b47-a842471064c9 FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP ducktape https://buildkite.com/redpanda/redpanda/builds/61744#0194e2b4-2d3b-478a-bb8d-42f11b95d41f FLAKY 1/4
rptest.tests.datalake.datalake_dlq_test.DatalakeDLQTest.test_invalid_record_action.cloud_storage_type=CloudStorageType.S3.query_engine=QueryEngineType.SPARK.use_topic_property=True.action=drop ducktape https://buildkite.com/redpanda/redpanda/builds/61744#0194e298-ec17-49f9-b328-0eca8eaeca69 FLAKY 1/2
rptest.tests.topic_recovery_test.TopicRecoveryTest.test_prevent_recovery.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/61744#0194e298-ec15-468e-83e1-abd20177be90 FLAKY 1/2
test results on build#61777
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61777#0194f145-135a-4aab-a43a-5130e560491d FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC ducktape https://buildkite.com/redpanda/redpanda/builds/61777#0194f14a-2ebc-453b-85f7-a7dd87f8a74a FLAKY 1/2
rptest.tests.e2e_shadow_indexing_test.ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61777#0194f14a-2ebf-4ad7-9645-5259e9cc5d22 FLAKY 1/2

@oleiman oleiman force-pushed the dlib/core-8928/target-lag-topic-prop branch from d6f0245 to c50d858 Compare February 7, 2025 16:59
@oleiman oleiman requested a review from bharathv February 7, 2025 17:01
@oleiman oleiman marked this pull request as ready for review February 7, 2025 17:01
@oleiman oleiman requested a review from a team as a code owner February 7, 2025 17:01
@oleiman
Copy link
Member Author

oleiman commented Feb 7, 2025

@bharathv - I have a few todos sprinkled in here related to property bounds. Currently I have it set up to match the commit interval cluster config, but that doesn't feel quite right. wdyt?

"Default value for the redpanda.iceberg.target.lag.ms topic property",
{.needs_restart = needs_restart::no, .visibility = visibility::user},
// TODO(oren): better default?
std::chrono::milliseconds(1min),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is the default should be a bit higher, at least 5mins or so, just to give flexibility to the scheduler to schedule in more interesting ways rather than just missing deadlines all the time. This needs an empirical evaluation once all the parts are hooked up.

Also purely from a user standpoint my understanding is very few users really want to land data within a minute, my guess is this is typically in hours (I could be wrong here) but the default should be orders of minutes at least.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5mins or so...to give flexibility to the scheduler

makes sense. by the same token, should the minimum value be a couple orders of magnitude higher? not sure how much of a footgun it to allow setting to 10ms, but it doesn't seem like a realistic choice in any case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree.

, iceberg_target_lag_ms(
*this,
"iceberg_target_lag_ms",
"Default value for the redpanda.iceberg.target.lag.ms topic property",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a little more detail on what the property controls?

@@ -150,6 +153,8 @@ bool topic_properties::requires_remote_erase() const {
&& !read_replica.value_or(false) && remote_delete;
}

// TODO(oren): need a check somewhere s.t. if iceberg is enabled we populate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did something like this in the past

Users of this property get ntp_config from the raft->log instance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. i think in this case doing nothing is probably fine. I assume we'll still have something like this once your port is done?

start_translator(
partition,
topic_cfg->properties.iceberg_mode,
topic_cfg->properties.iceberg_invalid_record_action.value_or(
config::shard_local_cfg().iceberg_invalid_record_action.value()));

then just wire the topic config in there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly.

src/v/cluster/types.cc Outdated Show resolved Hide resolved
ConfigProperty(
config_type="LONG",
value="10",
doc_string="Something just tell me by failing",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol. and of course it didn't fail, so I forgot about it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D, adding new properties is surprisingly complicated, so many places to update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i think it's even a bit weirder than that. At least half of the ConfigPropertys here never appear in the describe response as formulated, so they are just dead code afaict.

Looks to be a case of many properties not appearing in describe responses unless explicitly requested? I'll see whether I can fix it up async with this diff.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 7, 2025

Retry command for Build#61737

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_valid_settings

Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added @mattschumpert to the PR to sign-off on the config/property names and defaults.

"Default value for the redpanda.iceberg.target.lag.ms topic property",
{.needs_restart = needs_restart::no, .visibility = visibility::user},
// TODO(oren): better default?
std::chrono::milliseconds(1min),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree.

ConfigProperty(
config_type="LONG",
value="10",
doc_string="Something just tell me by failing",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D, adding new properties is surprisingly complicated, so many places to update.

@oleiman oleiman force-pushed the dlib/core-8928/target-lag-topic-prop branch from c50d858 to c1f6297 Compare February 7, 2025 21:31
@oleiman
Copy link
Member Author

oleiman commented Feb 7, 2025

force push CR comments

Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, bunch of nits.

src/v/config/configuration.cc Outdated Show resolved Hide resolved
"effor fashion, subject to resource availability.",
{.needs_restart = needs_restart::no, .visibility = visibility::user},
std::chrono::milliseconds(5min),
{.min = 10ms, .max = serde::max_serializable_ms})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump the min default too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was going to see whether @mattschumpert would chime in, but I'll just stick it at like 10s and we can change if needed.

@@ -150,6 +153,8 @@ bool topic_properties::requires_remote_erase() const {
&& !read_replica.value_or(false) && remote_delete;
}

// TODO(oren): need a check somewhere s.t. if iceberg is enabled we populate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly.

try {
auto val = boost::lexical_cast<std::chrono::milliseconds::rep>(
it->value.value());
return val >= 10 && val <= serde::max_serializable_ms.count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, bump the default min?

Default value for the corresponding topic property:

'redpanda.iceberg.target.lag.ms'
@oleiman oleiman force-pushed the dlib/core-8928/target-lag-topic-prop branch from c1f6297 to b6b8d40 Compare February 10, 2025 17:52
@oleiman oleiman requested a review from bharathv February 10, 2025 17:52
@oleiman
Copy link
Member Author

oleiman commented Feb 10, 2025

CI Failure: bazel build raft_reconfiguration_test timeout, unrelated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants