Skip to content

feat: Transform strings with respect to property schema when conforming properties #2997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

ReubenFrankel
Copy link
Contributor

@ReubenFrankel ReubenFrankel commented Apr 22, 2025

Adds support for transforming string values to their defined property schema types.


Related discussion: #2994
Closes #3014

Summary by Sourcery

Add automatic type conversion for string properties based on their schema during data conforming.

New Features:

  • Transform string values to boolean, integer, number, array, or object types according to the property's JSON schema definition.
  • Handle empty strings appropriately for each target type (e.g., 0 for integer/number, False for boolean, empty list/dict for array/object).
  • Treat 'inf' and 'nan' strings as null when converting to numbers.
  • Parse JSON strings when converting to array or object types.
  • Perform case-insensitive matching for 'true' when converting to boolean; other strings become False.
  • Return None for empty strings if the target schema is nullable and not string type itself

Tests:

  • Add unit tests to cover various string transformation scenarios, including edge cases.

Copy link
Contributor

sourcery-ai bot commented Apr 22, 2025

Reviewer's Guide by Sourcery

This pull request adds support for automatically transforming string values to the data type specified by a property's JSON schema during property conformance. This is achieved by introducing a new helper function that handles the conversion logic for various target types (boolean, integer, number, array, object) and integrating it into the existing property conformance function. Extensive test cases are added to cover different string inputs and target types.

No diagrams generated as the changes look simple and do not need a visual representation.

File-Level Changes

Change Details Files
Add new helper function to transform string properties to target types.
  • Add _transform_string_property function.
  • Implement logic to convert strings to boolean, integer, number, array, and object.
  • Handle empty string inputs based on the target type.
singer_sdk/helpers/_typing.py
Integrate string transformation into the primitive property conformance logic.
  • Modify _conform_primitive_property to check for string input and non-string schema.
  • Call _transform_string_property when a string value needs conversion.
singer_sdk/helpers/_typing.py
Add comprehensive tests for string to type transformations.
  • Add new test cases to test_conform_primitives for string inputs to various types.
  • Include tests for empty string handling.
  • Add tests for case-insensitive boolean conversion.
  • Add tests for JSON string parsing for array and object types.
tests/core/test_typing.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

codecov bot commented Apr 22, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.70%. Comparing base (d7e844e) to head (97eed5a).

Additional details and impacted files
@@            Coverage Diff            @@
##           main    #2997       +/-   ##
=========================================
+ Coverage      0   91.70%   +91.70%     
=========================================
  Files         0       62       +62     
  Lines         0     5330     +5330     
  Branches      0      690      +690     
=========================================
+ Hits          0     4888     +4888     
- Misses        0      311      +311     
- Partials      0      131      +131     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

codspeed-hq bot commented Apr 22, 2025

CodSpeed Performance Report

Merging #2997 will not alter performance

Comparing ReubenFrankel:feat/conform-primitive-interpret-string (97eed5a) with main (d7e844e)

Summary

✅ 8 untouched benchmarks

@ReubenFrankel ReubenFrankel marked this pull request as ready for review April 25, 2025 12:26
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ReubenFrankel - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider refactoring the _transform_string_property function to delegate type-specific transformations to smaller helper functions, improving clarity and maintainability.
  • Define the expected behavior when a non-empty string cannot be successfully transformed into the target schema type (e.g., invalid JSON).
Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🟢 Security: all looks good
  • 🟡 Testing: 3 issues found
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Collaborator

@edgarrmondragon edgarrmondragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ReubenFrankel!

My main concern with this would've been performance, but the benchmark seems unaffected? I could be missing something though.

@ReubenFrankel
Copy link
Contributor Author

My main concern with this would've been performance, but the benchmark seems unaffected? I could be missing something though.

It looks like it was slightly affected: https://codspeed.io/meltano/sdk/branches/ReubenFrankel%3Afeat%2Fconform-primitive-interpret-string?uri=tests%2Fcore%2Ftest_typing.py%3A%3Atest_bench_conform_record_data_types

Possibly due to this new check, which will check the property schema type for every str value:

if isinstance(elem, str) and not is_string_type(property_schema):

return None
return elem
return elem if math.isfinite(elem) else None
if isinstance(elem, str) and not is_string_type(property_schema):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(commenting here to start a thread)

It looks like it was slightly affected: https://codspeed.io/meltano/sdk/branches/ReubenFrankel%3Afeat%2Fconform-primitive-interpret-string?uri=tests%2Fcore%2Ftest_typing.py%3A%3Atest_bench_conform_record_data_types

Possibly due to this new check, which will check the property schema type for every str value:

if isinstance(elem, str) and not is_string_type(property_schema):

Do you think there's a way we could prevent this regression? Perhaps caching a mapping of property > is_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking about some kind of type caching also. We did something like that here, or were you imagining it to be per-property?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, something like or like the selection mask we use internally here.

@edgarrmondragon edgarrmondragon modified the milestones: v0.46, v0.47 May 6, 2025
@ReubenFrankel ReubenFrankel force-pushed the feat/conform-primitive-interpret-string branch from e86ce26 to e2aac19 Compare May 10, 2025 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Conform primitive properties from strings
2 participants