-
Notifications
You must be signed in to change notification settings - Fork 13
feat: Validate initial overrides values against column-level validation rules during schema sampling #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #166 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 51 51
Lines 2907 2919 +12
=========================================
+ Hits 2907 2919 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Thanks for starting work on this, this has been bothering me forever! :) One potential thought: we could lower the default sampling iterations (to reach the maximum faster) and raise the validation error from the very last iteration, i.e. run |
…ate-initial-overrides
Interesting! I like the idea of validating when we reach the maximum number of iterations because that could give the user nice hints as to which rules they need to fix (independently of the specific issues in this PR). I am not sure whether lowering the default sampling iterations would mean that the changes in this PR do not also help, but I also lack some insight on how many sampling iterations are usually needed: In some instances, running into the maximum number of iterations has actually taken a bit of time for me, so to get quick feedback, we would have to lower the default by a lot. That would be nice to reduce code complexity (because the changes here would not be needed), but if a few thousand iterations usually help, I would prefer to also have the fix here and get the immediate feedback. |
I'd say that, currently, the max iterations are unreasonably high (10000). I have the feeling that we rarely need more than 100 iterations (and if you do, it's usually so slow that you want to fix it regardless). IMO, this would be sufficiently fast to reduce complexity here 🤔 |
I checked how many iterations my tests typically need. With some additional hints that I had already added for |
Superseded by #167 |
Motivation
When sampling data frames from schemas, I often find myself with sampling operations that do not run through because some overrides I used are not compliant with the column rules (e.g., strings that do not comply with regexes). In such cases, it is hard to distinguish such easy-to-fix mistakes from situations in which the schema is just hard to sample and would require more overrides. At the same time, while we cannot validate the overrides against the entire schema (as it would typically fail), we can at least check the overrides against column-level rules. This makes it much easier to spot such issues.
Changes