feat: Validate initial overrides values against column-level validation rules during schema sampling #166

MoritzPotthoffQC · 2025-10-10T19:02:16Z

Motivation

When sampling data frames from schemas, I often find myself with sampling operations that do not run through because some overrides I used are not compliant with the column rules (e.g., strings that do not comply with regexes). In such cases, it is hard to distinguish such easy-to-fix mistakes from situations in which the schema is just hard to sample and would require more overrides. At the same time, while we cannot validate the overrides against the entire schema (as it would typically fail), we can at least check the overrides against column-level rules. This makes it much easier to spot such issues.

Changes

Refactored parts of the filtering logic to make them accessible to be reused for sampling
Added a step to sampling to check the initial overrides against column rules
Added test

codecov · 2025-10-10T19:37:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (7b3be2d) to head (189ed37).

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #166   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           51        51           
  Lines         2907      2919   +12     
=========================================
+ Hits          2907      2919   +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dataframely/schema.py

borchero · 2025-10-10T20:53:51Z

Thanks for starting work on this, this has been bothering me forever! :)

One potential thought: we could lower the default sampling iterations (to reach the maximum faster) and raise the validation error from the very last iteration, i.e. run validate if we run out of iterations and raise from there. Wdyt?

…ate-initial-overrides

MoritzPotthoffQC · 2025-10-13T07:58:48Z

One potential thought: we could lower the default sampling iterations (to reach the maximum faster) and raise the validation error from the very last iteration, i.e. run validate if we run out of iterations and raise from there. Wdyt?

Interesting! I like the idea of validating when we reach the maximum number of iterations because that could give the user nice hints as to which rules they need to fix (independently of the specific issues in this PR).

I am not sure whether lowering the default sampling iterations would mean that the changes in this PR do not also help, but I also lack some insight on how many sampling iterations are usually needed: In some instances, running into the maximum number of iterations has actually taken a bit of time for me, so to get quick feedback, we would have to lower the default by a lot. That would be nice to reduce code complexity (because the changes here would not be needed), but if a few thousand iterations usually help, I would prefer to also have the fix here and get the immediate feedback.

borchero · 2025-10-13T08:02:07Z

I'd say that, currently, the max iterations are unreasonably high (10000). I have the feeling that we rarely need more than 100 iterations (and if you do, it's usually so slow that you want to fix it regardless). IMO, this would be sufficiently fast to reduce complexity here 🤔

MoritzPotthoffQC · 2025-10-13T14:29:01Z

I'd say that, currently, the max iterations are unreasonably high (10000). I have the feeling that we rarely need more than 100 iterations (and if you do, it's usually so slow that you want to fix it regardless). IMO, this would be sufficiently fast to reduce complexity here 🤔

I checked how many iterations my tests typically need. With some additional hints that I had already added for sample to be reasonable fast, I used up to ~600 iterations in rare cases, without those hints it was up to 4000. It's a fairly complex schema though to be fair. So I would put the default a bit higher maybe, but fair point. I will try that out in a separate PR.

MoritzPotthoffQC · 2025-10-14T07:25:05Z

Superseded by #167

init

1f03f1e

MoritzPotthoffQC self-assigned this Oct 10, 2025

github-actions bot added the enhancement New feature or request label Oct 10, 2025

MoritzPotthoffQC changed the title ~~feat: Validate initial vales of the overrides against column-level validation rules during schema sampling~~ feat: Validate initial overrides values against column-level validation rules during schema sampling Oct 10, 2025

refactor

7fa26eb

refactor

9e9c437

MoritzPotthoffQC commented Oct 10, 2025

View reviewed changes

dataframely/schema.py Show resolved Hide resolved

MoritzPotthoffQC added 2 commits October 10, 2025 22:01

docs

d745ab7

More tests

321c64f

MoritzPotthoffQC marked this pull request as ready for review October 10, 2025 20:16

MoritzPotthoffQC requested review from AndreasAlbertQC, borchero and delsner as code owners October 10, 2025 20:16

Merge remote-tracking branch 'origin/main' into schema-sampling-valid…

189ed37

…ate-initial-overrides

MoritzPotthoffQC marked this pull request as draft October 13, 2025 14:30

MoritzPotthoffQC mentioned this pull request Oct 13, 2025

feat: Include validation failure information in exception after sampling exceeded maximum iterations #167

Open

MoritzPotthoffQC closed this Oct 14, 2025

MoritzPotthoffQC deleted the schema-sampling-validate-initial-overrides branch October 14, 2025 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Validate initial overrides values against column-level validation rules during schema sampling #166

feat: Validate initial overrides values against column-level validation rules during schema sampling #166

MoritzPotthoffQC commented Oct 10, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

borchero commented Oct 10, 2025

Uh oh!

MoritzPotthoffQC commented Oct 13, 2025

Uh oh!

borchero commented Oct 13, 2025 •

edited

Loading

Uh oh!

MoritzPotthoffQC commented Oct 13, 2025 •

edited

Loading

Uh oh!

MoritzPotthoffQC commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Validate initial overrides values against column-level validation rules during schema sampling #166

feat: Validate initial overrides values against column-level validation rules during schema sampling #166

Conversation

MoritzPotthoffQC commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Uh oh!

codecov bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

borchero commented Oct 10, 2025

Uh oh!

MoritzPotthoffQC commented Oct 13, 2025

Uh oh!

borchero commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MoritzPotthoffQC commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MoritzPotthoffQC commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MoritzPotthoffQC commented Oct 10, 2025 •

edited

Loading

codecov bot commented Oct 10, 2025 •

edited

Loading

borchero commented Oct 13, 2025 •

edited

Loading

MoritzPotthoffQC commented Oct 13, 2025 •

edited

Loading