Skip to content

Conversation

MoritzPotthoffQC
Copy link
Contributor

@MoritzPotthoffQC MoritzPotthoffQC commented Oct 10, 2025

Motivation

When sampling data frames from schemas, I often find myself with sampling operations that do not run through because some overrides I used are not compliant with the column rules (e.g., strings that do not comply with regexes). In such cases, it is hard to distinguish such easy-to-fix mistakes from situations in which the schema is just hard to sample and would require more overrides. At the same time, while we cannot validate the overrides against the entire schema (as it would typically fail), we can at least check the overrides against column-level rules. This makes it much easier to spot such issues.

Changes

  • Refactored parts of the filtering logic to make them accessible to be reused for sampling
  • Added a step to sampling to check the initial overrides against column rules
  • Added test

@MoritzPotthoffQC MoritzPotthoffQC self-assigned this Oct 10, 2025
@github-actions github-actions bot added the enhancement New feature or request label Oct 10, 2025
@MoritzPotthoffQC MoritzPotthoffQC changed the title feat: Validate initial vales of the overrides against column-level validation rules during schema sampling feat: Validate initial overrides values against column-level validation rules during schema sampling Oct 10, 2025
Copy link

codecov bot commented Oct 10, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (7b3be2d) to head (189ed37).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #166   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           51        51           
  Lines         2907      2919   +12     
=========================================
+ Hits          2907      2919   +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MoritzPotthoffQC MoritzPotthoffQC marked this pull request as ready for review October 10, 2025 20:16
@borchero
Copy link
Member

Thanks for starting work on this, this has been bothering me forever! :)

One potential thought: we could lower the default sampling iterations (to reach the maximum faster) and raise the validation error from the very last iteration, i.e. run validate if we run out of iterations and raise from there. Wdyt?

@MoritzPotthoffQC
Copy link
Contributor Author

One potential thought: we could lower the default sampling iterations (to reach the maximum faster) and raise the validation error from the very last iteration, i.e. run validate if we run out of iterations and raise from there. Wdyt?

Interesting! I like the idea of validating when we reach the maximum number of iterations because that could give the user nice hints as to which rules they need to fix (independently of the specific issues in this PR).

I am not sure whether lowering the default sampling iterations would mean that the changes in this PR do not also help, but I also lack some insight on how many sampling iterations are usually needed: In some instances, running into the maximum number of iterations has actually taken a bit of time for me, so to get quick feedback, we would have to lower the default by a lot. That would be nice to reduce code complexity (because the changes here would not be needed), but if a few thousand iterations usually help, I would prefer to also have the fix here and get the immediate feedback.

@borchero
Copy link
Member

borchero commented Oct 13, 2025

I'd say that, currently, the max iterations are unreasonably high (10000). I have the feeling that we rarely need more than 100 iterations (and if you do, it's usually so slow that you want to fix it regardless). IMO, this would be sufficiently fast to reduce complexity here 🤔

@MoritzPotthoffQC
Copy link
Contributor Author

MoritzPotthoffQC commented Oct 13, 2025

I'd say that, currently, the max iterations are unreasonably high (10000). I have the feeling that we rarely need more than 100 iterations (and if you do, it's usually so slow that you want to fix it regardless). IMO, this would be sufficiently fast to reduce complexity here 🤔

I checked how many iterations my tests typically need. With some additional hints that I had already added for sample to be reasonable fast, I used up to ~600 iterations in rare cases, without those hints it was up to 4000. It's a fairly complex schema though to be fair. So I would put the default a bit higher maybe, but fair point. I will try that out in a separate PR.

@MoritzPotthoffQC
Copy link
Contributor Author

Superseded by #167

@MoritzPotthoffQC MoritzPotthoffQC deleted the schema-sampling-validate-initial-overrides branch October 14, 2025 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants