-
Notifications
You must be signed in to change notification settings - Fork 106
fix(#1315): retry observation_flat row-count to avoid timing flake #1378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
84102d0
to
69b25f5
Compare
Hi @bashir2, could you please review this PR when you have a moment? I've added a retry loop for checking observation_flat row-count to resolve intermittent E2E test failures caused by delays in view materialization (as outlined in issue #1315). I noticed the count was incorrect due to lag, triggering false negatives. ps: didnt seen activity from the original assignee for months, so I thought I’d go ahead and raise a PR to move it forward. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1378 +/- ##
=========================================
Coverage ? 46.76%
Complexity ? 666
=========================================
Files ? 90
Lines ? 5808
Branches ? 799
=========================================
Hits ? 2716
Misses ? 2805
Partials ? 287 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pratt4 for your contribution; please take a look at my comments below:
done | ||
|
||
else | ||
# If no VIEWS_TIMESTAMP_*/observation_flat folder, check for a direct observation_flat (JDBC mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this else
part needed? Flat views should go under VIEWS_TIMESTAMP_*
sub-dirs. BTW, when you extend this approach to other metrics (as I suggested above), you should get the location of the directory with Parquet files as an argument to that function you create.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this resolved? I think this is related to that :
separator feature, which I still think is not required.
Hey @bashir2
I’m still pretty new to the project, so please let me know if anything else looks off or if there’s a preferred way you’d like these changes structured. Happy to tweak as needed thanks again for your time! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pratt4 for the changes, this looks much better. I just made some minor suggestions. BTW, when you are done doing the changes, please make sure that all comments are resolved or replied if they need more clarification/discussion.
|
||
local total_obs_flat | ||
total_obs_flat=$(retry_rowcount \ | ||
"${output}/*/VIEWS_TIMESTAMP_*/observation_flat/":"${output}/*/observation_flat/" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to check ${output}/*/observation_flat/
too? All flat views (including observations) should be in a VIEWS_TIMESTAMP_*
subdir.
|
||
# ── 6. Sleep & retry | ||
retries=$((retries + 1)) | ||
echo "E2E TEST: [${label}] raw=${raw_count}, expected=${expected} — retry ${retries}/${max_retries} in ${sleep_secs}s" >&2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please break long lines, here and everywhere else, unless there is good reason not to do so (style guide rule).
local raw_count=0 | ||
local final_count=0 | ||
|
||
IFS=':' read -r -a paths <<<"${globs}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented in the other file whether this multiple paths option is really needed. I think it is not, and if that is the case, then I suggest the we drop this :
separator option and simplify this function.
|
||
# ── 2. Normalise raw_count | ||
if [[ -z "${raw_count}" || ! "${raw_count}" =~ ^[0-9]+$ ]]; then | ||
final_count=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please log an error message in this case.
done | ||
|
||
else | ||
# If no VIEWS_TIMESTAMP_*/observation_flat folder, check for a direct observation_flat (JDBC mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this resolved? I think this is related to that :
separator feature, which I still think is not required.
Added a retry loop around the observation_flat row-count in the validation script (e2e-tests/controller-spark/controller_spark_sql_validation.sh) to work around a timing flake (#1315). Now the script will attempt up to 5 retries (sleeping 5 seconds each) when the “observation_flat” parquet-tools count does not immediately match the expected number, before declaring failure.
fixes #1315