Skip to content

Conversation

@david-leifker
Copy link
Collaborator

@david-leifker david-leifker commented Nov 26, 2025

Improve Kafka Consumer Pool and Python Client Retry Logic

Summary

This PR improves the Kafka consumer pool's efficiency and thread-safety, and enhances the Python client's retry capabilities for handling transient connection failures.

Changes

Kafka Consumer Pool (KafkaConsumerPool.java)

Fixes:

  • Fixed race condition: Protected totalConsumersCreated decrement with poolManagementLock to prevent exceeding maxPoolSize under concurrent access
  • Fixed lock inefficiency: Optimized shutdownPool() to avoid re-acquiring locks unnecessarily
  • Removed redundant operation: Eliminated redundant activeConsumers.clear() call

Optimizations:

  • Optimized remainingTime calculation to compute once per loop iteration
  • Improved lock ordering to prevent potential deadlocks

Python Client Retry Logic (datahub_cloud_events_consumer.py)

Enhancements:

  • Added infinite_retry option to allow indefinite retries on connection failures
  • Increased default retry attempts from 3 to 15
  • Increased maximum exponential backoff from 30s to 60s
  • Maintains exponential backoff (2s to 60s) for both finite and infinite retry modes

Configuration:

  • infinite_retry: false (default): Retries up to 15 times with exponential backoff
  • infinite_retry: true: Retries indefinitely with exponential backoff until connection restored

Documentation

  • Updated datahub-cloud-event-source.md with infinite_retry configuration details
  • Added FAQ entry explaining retry behavior and how to handle transient connection failures

Impact

  • Improved reliability: Better handling of transient Kafka connectivity issues
  • Better resource utilization: More efficient consumer pool management
  • Enhanced resilience: Python client can now handle longer outages with infinite retry option
  • Thread-safety: Fixed race conditions in concurrent consumer pool operations

Includes: #15417

@github-actions github-actions bot added docs Issues and Improvements to docs devops PR or Issue related to DataHub backend & deployment labels Nov 26, 2025
@github-actions github-actions bot requested a deployment to datahub-wheels (Preview) November 26, 2025 21:09 Abandoned
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 26, 2025
@codecov
Copy link

codecov bot commented Nov 26, 2025

Codecov Report

❌ Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ugin/source/acryl/datahub_cloud_events_consumer.py 87.50% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@alwaysmeticulous
Copy link

alwaysmeticulous bot commented Nov 28, 2025

✅ Meticulous spotted 0 visual differences across 998 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

Expected differences? Click here. Last updated for commit c325865. This comment will update as new commits are pushed.

* Check consumers on access from kafka consumer pool
* Allow infinite retries from python actions consumer
* Locking fixes, race condition fixes with consumer pool
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops PR or Issue related to DataHub backend & deployment docs Issues and Improvements to docs pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants