Skip to content

Conversation

@daha
Copy link
Contributor

@daha daha commented Nov 28, 2025

What

Fixes JSON serialization error when IngestionStageReport (and other reports) contain tuple or enum dictionary keys.

Why

The DataHub GC source and other sources using stage tracking fail with:

TypeError: keys must be str, int, float, bool or None, not tuple

This occurs because:

  1. IngestionStageReport.ingestion_stage_durations uses tuple keys: (IngestionHighStage, str)
  2. IngestionStageReport.ingestion_high_stage_seconds uses enum keys: IngestionHighStage
  3. Report.to_pure_python_obj() doesn't properly convert these to JSON-compatible strings

JSON specification requires object keys to be strings (or numbers converted to strings). Python tuple and enum keys must be explicitly converted.

How

Updated Report.to_pure_python_obj() method in metadata-ingestion/src/datahub/ingestion/api/report.py:

  • Changed dict comprehension to explicit loop for better control
  • Added explicit handling for tuple keys: convert to string representation
  • Added explicit handling for enum keys: use .value if available, fallback to string
  • Maintained backward compatibility for existing key types (str, int, etc.)
  • Recursively processes nested structures

Testing

Unit Tests Added

  • Test tuple key conversion (simple and nested)
  • Test enum key conversion
  • Test TopKDict with tuple keys
  • Test actual IngestionStageReport serialization
  • Test backward compatibility with existing key types

Related Issues

Fixes #15445

Before you submit your PR, please go through the checklist below:

  • The PR conforms to DataHub's Contributing Guideline (particularly PR Title Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Nov 28, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 28, 2025
@daha daha marked this pull request as draft November 28, 2025 20:03
@daha daha force-pushed the gc-report-fix branch 3 times, most recently from 6517608 to a451808 Compare November 28, 2025 21:05
@daha daha marked this pull request as ready for review November 29, 2025 11:55
- Add support for tuple keys in dict serialization (converts to string representation)
- Add support for enum keys in dict serialization (uses enum.name for consistency)
- Recursively process SupportsAsObj.as_obj() results to handle nested conversions
- Manually iterate dataclass fields instead of using dataclasses.asdict() to handle enum/tuple keys
- Fix datetime fallback to use str(some_val) instead of str(datetime class)
- Add debug logging when datetime relative formatting fails
- Improve docstring to accurately describe recursive conversion behavior
- Add comprehensive test suite covering tuple keys, enum keys, TopKDict, backward compatibility, and edge cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

reporting crash when running datahub-gc source from CLI

1 participant