fix(#1315): retry observation_flat row-count to avoid timing flake #1378

pratt4 · 2025-06-01T14:00:29Z

Added a retry loop around the observation_flat row-count in the validation script (e2e-tests/controller-spark/controller_spark_sql_validation.sh) to work around a timing flake (#1315). Now the script will attempt up to 5 retries (sleeping 5 seconds each) when the “observation_flat” parquet-tools count does not immediately match the expected number, before declaring failure.

fixes #1315

google-cla · 2025-06-01T14:00:32Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

pratt4 · 2025-06-02T16:53:01Z

Hi @bashir2, could you please review this PR when you have a moment?

I've added a retry loop for checking observation_flat row-count to resolve intermittent E2E test failures caused by delays in view materialization (as outlined in issue #1315). I noticed the count was incorrect due to lag, triggering false negatives.

ps: didnt seen activity from the original assignee for months, so I thought I’d go ahead and raise a PR to move it forward.
Thanks!!!

codecov-commenter · 2025-06-11T19:24:19Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (master@6601031). Learn more about missing BASE report.
Report is 3 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##             master    #1378   +/-   ##
=========================================
  Coverage          ?   46.76%           
  Complexity        ?      666           
=========================================
  Files             ?       90           
  Lines             ?     5808           
  Branches          ?      799           
=========================================
  Hits              ?     2716           
  Misses            ?     2805           
  Partials          ?      287

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bashir2

Thanks @pratt4 for your contribution; please take a look at my comments below:

e2e-tests/controller-spark/controller_spark_sql_validation.sh

bashir2 · 2025-06-11T20:00:24Z

e2e-tests/controller-spark/controller_spark_sql_validation.sh

+      done
+
+    else
+      # If no VIEWS_TIMESTAMP_*/observation_flat folder, check for a direct observation_flat (JDBC mode)


Why is this else part needed? Flat views should go under VIEWS_TIMESTAMP_* sub-dirs. BTW, when you extend this approach to other metrics (as I suggested above), you should get the location of the directory with Parquet files as an argument to that function you create.

Is this resolved? I think this is related to that : separator feature, which I still think is not required.

e2e-tests/controller-spark/controller_spark_sql_validation.sh

…5-retry-obs

pratt4 · 2025-06-16T19:15:21Z

Hey @bashir2
thanks for the thorough review on this PR!
I’ve pushed a fresh commit that addresses everything you flagged:

Reusable helper e2e-tests/lib/parquet_utils.sh now houses a single retry_rowcount() function. All six metrics in controller_spark_sql_validation.sh call it, so the big inline loop is gone.
(ROWCOUNT_SLEEP_SECS, ROWCOUNT_MAX_RETRIES -- these variables can be overridden in CI if needed)
Style cleanup
Next steps for the pipeline script: Opened issue Refactor pipeline_validation.sh to use shared retry_rowcount() helper #1394 to port the same logic to pipeline_validation.sh. I’ll open that follow-up PR right after this one land so we can keep the history tidy and maintain separation of concerns.

I’m still pretty new to the project, so please let me know if anything else looks off or if there’s a preferred way you’d like these changes structured. Happy to tweak as needed

thanks again for your time!

bashir2

Thanks @pratt4 for the changes, this looks much better. I just made some minor suggestions. BTW, when you are done doing the changes, please make sure that all comments are resolved or replied if they need more clarification/discussion.

bashir2 · 2025-06-19T19:05:26Z

e2e-tests/controller-spark/controller_spark_sql_validation.sh

+
+    local total_obs_flat
+    total_obs_flat=$(retry_rowcount \
+      "${output}/*/VIEWS_TIMESTAMP_*/observation_flat/":"${output}/*/observation_flat/" \


Why do we need to check ${output}/*/observation_flat/ too? All flat views (including observations) should be in a VIEWS_TIMESTAMP_* subdir.

bashir2 · 2025-06-19T19:08:13Z

e2e-tests/lib/parquet_utils.sh

+
+    # ── 6. Sleep & retry
+    retries=$((retries + 1))
+    echo "E2E TEST: [${label}] raw=${raw_count}, expected=${expected} — retry ${retries}/${max_retries} in ${sleep_secs}s" >&2


nit: please break long lines, here and everywhere else, unless there is good reason not to do so (style guide rule).

bashir2 · 2025-06-19T19:37:38Z

e2e-tests/lib/parquet_utils.sh

+  local raw_count=0
+  local final_count=0
+
+  IFS=':' read -r -a paths <<<"${globs}"


I commented in the other file whether this multiple paths option is really needed. I think it is not, and if that is the case, then I suggest the we drop this : separator option and simplify this function.

bashir2 · 2025-06-19T19:38:47Z

e2e-tests/lib/parquet_utils.sh

+
+    # ── 2. Normalise raw_count
+    if [[ -z "${raw_count}" || ! "${raw_count}" =~ ^[0-9]+$ ]]; then
+      final_count=0


Please log an error message in this case.

e2e-tests/controller-spark/controller_spark_sql_validation.sh

bashir2 · 2025-06-19T19:40:54Z

e2e-tests/controller-spark/controller_spark_sql_validation.sh

+      done
+
+    else
+      # If no VIEWS_TIMESTAMP_*/observation_flat folder, check for a direct observation_flat (JDBC mode)


Is this resolved? I think this is related to that : separator feature, which I still think is not required.

fix(google#1315): retry observation_flat row-count to avoid timing flake

69b25f5

pratt4 force-pushed the fix-1315-retry-obs branch from 84102d0 to 69b25f5 Compare June 1, 2025 14:48

fixed npe in the pipeline

1841ac7

pratt4 mentioned this pull request Jun 7, 2025

E2E flakiness due to syncing to a FHIR server in the FHIR-Search mode #1315

Open

bashir2 self-requested a review June 11, 2025 19:22

bashir2 reviewed Jun 11, 2025

View reviewed changes

pratt4 and others added 4 commits June 16, 2025 22:44

changes related to review comments

fac198a

Merge branch 'master' into fix-1315-retry-obs

52f66ec

changes related to review comments

da4be86

Merge remote-tracking branch 'origin/fix-1315-retry-obs' into fix-131…

d7624f8

…5-retry-obs

pratt4 mentioned this pull request Jun 16, 2025

Refactor pipeline_validation.sh to use shared retry_rowcount() helper #1394

Open

bashir2 reviewed Jun 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(#1315): retry observation_flat row-count to avoid timing flake #1378

fix(#1315): retry observation_flat row-count to avoid timing flake #1378

Uh oh!

pratt4 commented Jun 1, 2025

Uh oh!

google-cla bot commented Jun 1, 2025

Uh oh!

pratt4 commented Jun 2, 2025

Uh oh!

codecov-commenter commented Jun 11, 2025 •

edited

Loading

Uh oh!

bashir2 left a comment

Uh oh!

Uh oh!

bashir2 Jun 11, 2025

Uh oh!

bashir2 Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

pratt4 commented Jun 16, 2025 •

edited

Loading

Uh oh!

bashir2 left a comment

Uh oh!

bashir2 Jun 19, 2025

Uh oh!

bashir2 Jun 19, 2025

Uh oh!

bashir2 Jun 19, 2025

Uh oh!

bashir2 Jun 19, 2025

Uh oh!

Uh oh!

bashir2 Jun 19, 2025

Uh oh!

Uh oh!

fix(#1315): retry observation_flat row-count to avoid timing flake #1378

Are you sure you want to change the base?

fix(#1315): retry observation_flat row-count to avoid timing flake #1378

Uh oh!

Conversation

pratt4 commented Jun 1, 2025

Uh oh!

google-cla bot commented Jun 1, 2025

Uh oh!

pratt4 commented Jun 2, 2025

Uh oh!

codecov-commenter commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bashir2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bashir2 Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

bashir2 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pratt4 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bashir2 left a comment

Choose a reason for hiding this comment

Uh oh!

bashir2 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

bashir2 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

bashir2 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

bashir2 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bashir2 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Jun 11, 2025 •

edited

Loading

pratt4 commented Jun 16, 2025 •

edited

Loading