Skip to content

fix(#1315): retry observation_flat row-count to avoid timing flake #1378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

pratt4
Copy link

@pratt4 pratt4 commented Jun 1, 2025

Added a retry loop around the observation_flat row-count in the validation script (e2e-tests/controller-spark/controller_spark_sql_validation.sh) to work around a timing flake (#1315). Now the script will attempt up to 5 retries (sleeping 5 seconds each) when the “observation_flat” parquet-tools count does not immediately match the expected number, before declaring failure.

fixes #1315

Copy link

google-cla bot commented Jun 1, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@pratt4 pratt4 force-pushed the fix-1315-retry-obs branch from 84102d0 to 69b25f5 Compare June 1, 2025 14:48
@pratt4
Copy link
Author

pratt4 commented Jun 2, 2025

Hi @bashir2, could you please review this PR when you have a moment?

I've added a retry loop for checking observation_flat row-count to resolve intermittent E2E test failures caused by delays in view materialization (as outlined in issue #1315). I noticed the count was incorrect due to lag, triggering false negatives.

ps: didnt seen activity from the original assignee for months, so I thought I’d go ahead and raise a PR to move it forward.
Thanks!!!

@codecov-commenter
Copy link

codecov-commenter commented Jun 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (master@6601031). Learn more about missing BASE report.
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #1378   +/-   ##
=========================================
  Coverage          ?   46.76%           
  Complexity        ?      666           
=========================================
  Files             ?       90           
  Lines             ?     5808           
  Branches          ?      799           
=========================================
  Hits              ?     2716           
  Misses            ?     2805           
  Partials          ?      287           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pratt4 for your contribution; please take a look at my comments below:

done

else
# If no VIEWS_TIMESTAMP_*/observation_flat folder, check for a direct observation_flat (JDBC mode)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this else part needed? Flat views should go under VIEWS_TIMESTAMP_* sub-dirs. BTW, when you extend this approach to other metrics (as I suggested above), you should get the location of the directory with Parquet files as an argument to that function you create.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this resolved? I think this is related to that : separator feature, which I still think is not required.

@pratt4
Copy link
Author

pratt4 commented Jun 16, 2025

Hey @bashir2
thanks for the thorough review on this PR!
I’ve pushed a fresh commit that addresses everything you flagged:

  1. Reusable helper e2e-tests/lib/parquet_utils.sh now houses a single retry_rowcount() function. All six metrics in controller_spark_sql_validation.sh call it, so the big inline loop is gone.
    (ROWCOUNT_SLEEP_SECS, ROWCOUNT_MAX_RETRIES -- these variables can be overridden in CI if needed)

  2. Style cleanup 

  3. Next steps for the pipeline script: Opened issue Refactor pipeline_validation.sh to use shared retry_rowcount() helper #1394 to port the same logic to pipeline_validation.sh. I’ll open that follow-up PR right after this one land so we can keep the history tidy and maintain separation of concerns.

I’m still pretty new to the project, so please let me know if anything else looks off or if there’s a preferred way you’d like these changes structured. Happy to tweak as needed

thanks again for your time!

Copy link
Collaborator

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pratt4 for the changes, this looks much better. I just made some minor suggestions. BTW, when you are done doing the changes, please make sure that all comments are resolved or replied if they need more clarification/discussion.


local total_obs_flat
total_obs_flat=$(retry_rowcount \
"${output}/*/VIEWS_TIMESTAMP_*/observation_flat/":"${output}/*/observation_flat/" \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check ${output}/*/observation_flat/ too? All flat views (including observations) should be in a VIEWS_TIMESTAMP_* subdir.


# ── 6. Sleep & retry
retries=$((retries + 1))
echo "E2E TEST: [${label}] raw=${raw_count}, expected=${expected} — retry ${retries}/${max_retries} in ${sleep_secs}s" >&2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please break long lines, here and everywhere else, unless there is good reason not to do so (style guide rule).

local raw_count=0
local final_count=0

IFS=':' read -r -a paths <<<"${globs}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented in the other file whether this multiple paths option is really needed. I think it is not, and if that is the case, then I suggest the we drop this : separator option and simplify this function.


# ── 2. Normalise raw_count
if [[ -z "${raw_count}" || ! "${raw_count}" =~ ^[0-9]+$ ]]; then
final_count=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please log an error message in this case.

done

else
# If no VIEWS_TIMESTAMP_*/observation_flat folder, check for a direct observation_flat (JDBC mode)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this resolved? I think this is related to that : separator feature, which I still think is not required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

E2E flakiness due to syncing to a FHIR server in the FHIR-Search mode
3 participants