Skip to content

test(research): record full-lifecycle Deep Research cassette reaching completed#1566

Merged
teng-lin merged 3 commits into
mainfrom
feat/deep-research-completed-cassette
Jun 12, 2026
Merged

test(research): record full-lifecycle Deep Research cassette reaching completed#1566
teng-lin merged 3 commits into
mainfrom
feat/deep-research-completed-cassette

Conversation

@teng-lin

@teng-lin teng-lin commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

The committed tests/cassettes/research_deep_poll_long.yaml was deliberately scoped to START_DEEP_RESEARCH + 6 in-progress POLL_RESEARCH calls and never reached a terminal state, so research wait-to-completion against a real deep outcome was uncovered. Two earlier full-lifecycle recording attempts (2026-05-15) died with httpx.PoolTimeout ~15-20 min into the idle poll window.

This PR re-records the full Deep Research lifecycle to completed and wires it into the replay test.

What changed

Poll-until-terminal restructure

_poll_until_terminal replaces the old fixed-count _poll_n_times: it polls research.poll every 30s until a terminal ResearchStatus (COMPLETED/FAILED), capped at 80 polls (~40 min) so a hung run can't spin forever. The replay test now asserts the final status is ResearchStatus.COMPLETED (enum membership, not a string), and a separate cassette-inspection test decodes the last POLL_RESEARCH body through the project's own decode_response + parse_research_task_models to confirm the terminal state independently of the live client.

PoolTimeout-resilient recording client (record mode only)

The previous PoolTimeout was stale-connection reuse: the default 30s keepalive let the pool hand back a half-dead connection on a later poll across the multi-minute idle gap. In record mode the client is now built with a generous 60s pool-acquire/read timeout and a short 10s keepalive_expiry, so an idle connection is closed proactively and a fresh one is opened on the next poll. Both knobs go through the public constructor (timeout= → httpx's pool acquire timeout; limits=ConnectionLimits(keepalive_expiry=...)), so no private seam is needed and replay is unaffected. A @pytest.mark.timeout(3600) overrides the global 60s pytest-timeout for the long live recording.

Task-id pinning fix (load-bearing)

Verified live: Deep Research's start() task_id does not equal the task_id the poll endpoint reports for the same task. Threading start's id made every poll return NOT_FOUND. The loop now mirrors wait_for_completion — first poll unfiltered, then pins the POLL-reported id forward.

Trimming + scrubbing

The full recording's accumulating in-progress report bodies grew to ~400 KB (the cassette scrubber is super-linear on large markdown payloads, which blew up replay time). Trimmed to the in_progress → completed skeleton (3 polls) plus the create / add_text / start / delete lifecycle. The cassette was scrubbed on record and verified clean — cassette guards pass and a manual grep found no live cookies, OAuth/Bearer tokens, API keys, account email, or SAPISIDHASH. The scratch notebook self-cleans via try/finally.

Results

  • Live recording reached completed (no_research/in_progress → completed; PoolTimeout never recurred).
  • Final cassette: 0.815 MB (well under the 5 MB cap), 11 interactions.
  • Replay: ~12s, 3 tests pass.
  • tests/_guardrails/ (1022 tests), mypy src/notebooklm, and pre-commit run --all-files all clean.

🤖 Generated with Claude Code

Review in cubic

Summary by CodeRabbit

  • Tests
    • Rewrote Deep Research polling integration tests to record and replay the full START → polls → completed lifecycle.
    • Tests now assert the run reaches a terminal COMPLETED status and that the poll-reported task identifier stabilizes once present.
    • Updated cassette verification and guidance to support trimming redundant intermediate polls for replay size limits.

… completed

The prior `research_deep_poll_long.yaml` was scoped to START_DEEP_RESEARCH + 6
in-progress POLL_RESEARCH calls and never reached a terminal state, leaving
`research wait`-to-completion uncovered. Two earlier full-lifecycle recording
attempts died with `httpx.PoolTimeout` ~15-20 min into the idle poll window.

This re-records the full lifecycle to `completed`:

- Poll-until-terminal loop: polls `research.poll` every 30s until a terminal
  `ResearchStatus` (COMPLETED/FAILED), capped at 80 polls. Replay asserts the
  final status is `ResearchStatus.COMPLETED` via enum membership, and a
  cassette-inspection test decodes the last POLL body through the real
  decoder + parser to confirm the terminal state independently of the client.

- PoolTimeout-resilient recording client (record mode only): a generous
  60s pool-acquire/read timeout plus a short 10s keepalive_expiry so an
  idle connection across the multi-minute poll gap is closed proactively and
  a fresh one is opened on the next poll. Both knobs go through the public
  constructor (`timeout=` -> httpx pool timeout; `limits=ConnectionLimits(
  keepalive_expiry=...)`), so no private seam is needed and replay is
  unaffected. A `@pytest.mark.timeout(3600)` overrides the global 60s
  pytest-timeout for the long live recording.

- Task-id pinning fix: Deep Research's `start()` task_id does NOT equal the
  poll-reported task_id (verified live), so the loop mirrors
  `wait_for_completion` — first poll unfiltered, then pins the POLL-reported
  id. Threading start's id made every poll return NOT_FOUND.

- Trimming: the full recording's accumulating in_progress report bodies blew
  past a usable replay time (the cassette scrubber is super-linear on the
  large markdown payloads). Trimmed to the in_progress -> completed skeleton
  (3 polls) plus the create/add_text/start/delete lifecycle. Final cassette
  is 0.815 MB (well under the 5 MB cap); replay runs in ~12s.

Cassette scrubbed on record and verified clean: cassette guards pass and a
manual grep found no live cookies, OAuth/Bearer tokens, API keys, account
email, or SAPISIDHASH. The scratch notebook is self-cleaned via try/finally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@teng-lin

Copy link
Copy Markdown
Owner Author

@claude review

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 19825892-35fd-40f9-9ef2-c042b473008a

📥 Commits

Reviewing files that changed from the base of the PR and between 1599f1c and 7627b68.

📒 Files selected for processing (2)
  • tests/cassettes/research_deep_poll_long.yaml
  • tests/integration/test_research_deep_poll_vcr.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/test_research_deep_poll_vcr.py

📝 Walkthrough

Walkthrough

This PR refactors a Deep Research integration test to capture the full poll-until-terminal lifecycle instead of fixed-iteration polling, adds record-mode client hardening, implements a _poll_until_terminal helper with task-id pinning and max caps, strengthens cassette decoding/validation, and updates related docs and fixtures.

Changes

Deep Research Polling Lifecycle Test

Layer / File(s) Summary
Module documentation and test foundation
tests/integration/test_research_deep_poll_vcr.py
Module docstring and supporting documentation updated to describe the full START → terminal POLL_RESEARCH lifecycle. Sleep-mocking and recording/replay/trimming guidance clarified. Imports and types added for record-mode client configuration and ResearchStatus constants.
Polling configuration and record-mode helpers
tests/integration/test_research_deep_poll_vcr.py
Terminal status constants, a maximum poll cap, and recording-only timeouts introduced. New _is_record_mode() and _recording_client_kwargs() helpers provide record-mode detection and client hardening (timeouts, connection limits, keepalive) during recording.
Poll-until-terminal helper and fixture updates
tests/integration/test_research_deep_poll_vcr.py
New _poll_until_terminal() replaces the removed _poll_n_times(), polling until a terminal ResearchStatus is reached with first-appearance task ID pinning and hard cap enforcement. fast_sleep fixture now consults _is_record_mode() for replay vs record behavior.
Main test lifecycle and assertions
tests/integration/test_research_deep_poll_vcr.py
Test now executes the full lifecycle: start Deep Research, poll until terminal using the new helper, verify task ID becomes stable once present, and assert the final decoded result is ResearchStatus.COMPLETED. Old fixed-poll-count logic removed.
Cassette inspection and strengthened validation
tests/integration/test_research_deep_poll_vcr.py
Cassette parsing improved to track POLL_RESEARCH interactions and their bodies. Final POLL_RESEARCH response now decoded through the real decode_response + parse_research_task_models pipeline and asserted to be ResearchStatus.COMPLETED. Size-cap documentation updated to reference trimming redundant middle in_progress polls.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • teng-lin/notebooklm-py#645: Earlier refactor of the same test module that introduced _poll_n_times, which this PR replaces with _poll_until_terminal and enhanced cassette decoding validation.

Poem

🐰 I hopped from START through polls galore,
Pinned task IDs steady, till COMPLETED at the core,
Cassettes replay quickly, sleeps turned light,
The lifecycle now recorded and stitched up right. 🎬✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: recording a full-lifecycle Deep Research cassette reaching completed state.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/deep-research-completed-cassette

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @teng-lin's task in 3m 4s —— View job


Code Review — PR #1566: Full-lifecycle Deep Research cassette

  • Load tools
  • Read changed test file
  • Read related source files (_research.py, _research_task_parser.py, _types/research.py, vcr_config.py)
  • Analyze cassette structure
  • Post review findings

This is a well-designed PR that solves a genuine infrastructure problem (PoolTimeout during recording, task-id mismatch between start and poll). The approach — record-only resilience config, independent cassette-inspection test that decodes through the real pipeline, task-id pinning mirroring wait_for_completion — is all sound. The documentation is thorough. A few observations below, roughly in priority order.


_poll_until_terminal: NOT_FOUND and NO_RESEARCH are not terminal

# tests/integration/test_research_deep_poll_vcr.py:138
_TERMINAL_STATUSES: frozenset[ResearchStatus] = frozenset(
    {ResearchStatus.COMPLETED, ResearchStatus.FAILED}
)

After pinned_task_id is set, if the poll endpoint returns NOT_FOUND (task evicted / transient absence) or NO_RESEARCH, the loop silently continues until _MAX_POLLS rather than breaking or raising. This intentionally mirrors wait_for_completion (treating absence as replication lag), but the docstring only mentions the terminal-status break condition. A reader seeing NOT_FOUND come back mid-loop during a debug session won't immediately understand why the loop kept going.

Suggestion: Add one sentence to the _poll_until_terminal docstring: "NOT_FOUND and NO_RESEARCH are not treated as terminal — they continue the loop (mirrors wait_for_completion's replication-lag policy)." This is a docs-only nit; the logic is correct by design.


Redundant @pytest.mark.vcr on the test method

# line 124
pytestmark = [pytest.mark.vcr, skip_no_cassettes]

# line 356-358 — @pytest.mark.vcr is redundant here
@pytest.mark.timeout(_RECORD_TEST_TIMEOUT_SECONDS)
@pytest.mark.vcr
@pytest.mark.asyncio
@notebooklm_vcr.use_cassette(CASSETTE_NAME)
async def test_deep_research_polling_loop(self, fast_sleep: None) -> None:

pytest.mark.vcr is already on the module via pytestmark, so the per-method decorator is redundant. It does no harm (markers are additive), but it could mislead a reader into thinking the module-level mark is absent. Fix this →


fast_sleep acquires monkeypatch unnecessarily in record mode

# line 271
def fast_sleep(monkeypatch: pytest.MonkeyPatch) -> None:
    if _is_record_mode():
        return  # monkeypatch acquired but never used
    ...
    monkeypatch.setattr(asyncio, "sleep", instant_sleep)

In record mode monkeypatch is resolved by pytest (a non-trivial fixture that installs its own finalizer) but then never used. Functionally harmless, but a minor waste.

One alternative: move the _is_record_mode() guard to the test itself by checking it in the test body before calling _poll_until_terminal, and making fast_sleep unconditionally patch. But since the existing code works correctly and the setup cost is trivial, this is a very low priority nit.


test_cassette_reaches_completed: cassette interaction ordering assumption

# lines 498-510
for interaction in cassette.get("interactions", []):
    ...
    if rpc_id == RPCMethod.POLL_RESEARCH.value:
        ...
        last_poll_body = string  # silently overwritten each iteration

This relies on the cassette's interactions list being in recording order, which is always true for VCR.py cassettes. But if a future hand-trimming accidentally reorders interactions (keeping completed before an in_progress entry by mistake), last_poll_body would silently land on the wrong response without any diagnostic.

Adding rpcids_sequence.index(RPCMethod.POLL_RESEARCH.value) offset tracking would catch this, but it's likely over-engineering for this use case. The minimum-count assertion (poll_count >= MIN_POLL_INTERACTIONS) combined with the terminal-status check already provides good coverage. No action required, just noting the dependency.


✅ Things that are explicitly correct

  • Task-id pinning logic (lines 333–339): The unfiltered first poll → pin the POLL-reported id pattern exactly mirrors wait_for_completion in _research.py:559–560. This is load-bearing for Deep Research and correct.
  • decode_response + parse_research_task_models in test_cassette_reaches_completed: Using the real decode pipeline to assert terminal status (rather than string-grepping the YAML) is the right approach — the wire format encodes status numerically.
  • _recording_client_kwargs() empty-in-replay guard: ConnectionLimits(keepalive_expiry=10.0) only applied in record mode; replay sees the default client config. Clean isolation.
  • try/finally cleanup: The DELETE_NOTEBOOK is in finally, ensuring scratch notebooks are deleted even on mid-recording failures.
  • Cassette scrubbing: Verified SCRUBBED placeholders on all cookies; no raw auth tokens visible. Notebook UUID in URLs is not a secret.
  • MIN_POLL_INTERACTIONS = 3 floor: Low enough for hand-trimmed skeletons (no_research → in_progress → completed) but still catches a broken recorder that captured only 1–2 polls.
  • test_cassette_under_size_cap with @pytest.mark.allow_no_vcr: Correctly skips when the cassette is absent rather than failing.

Summary

Two actionable items (both minor):

  1. Docs nit: Add a note to _poll_until_terminal's docstring that NOT_FOUND/NO_RESEARCH don't break the loop by design.
  2. Cosmetic cleanup: Remove the redundant @pytest.mark.vcr from test_deep_research_polling_loop.

Neither blocks merge — the core logic, scrubbing, and test coverage are all sound. Approving in spirit.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the integration test test_research_deep_poll_vcr.py to record and replay the full-lifecycle Deep Research polling loop until it reaches a terminal state, replacing the previous scoped-down approach. It addresses previous httpx.PoolTimeout issues during recording by applying resilient client configurations (generous timeouts and short keepalive expiry) only in record mode. Additionally, it updates the polling logic to pin the task ID and assert that the final status is COMPLETED, while providing guidelines for trimming cassettes to stay under the 5 MB size limit. The code reviewer suggested adding a defensive assertion in _poll_until_terminal to verify that the poll response is not None before accessing its attributes, ensuring type safety and fail-fast behavior.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread tests/integration/test_research_deep_poll_vcr.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/test_research_deep_poll_vcr.py`:
- Around line 413-452: The test must assert the cassette exercised the
pinned-task-id path by verifying a non-empty poll.task_id appears in a non-final
poll and is reused later; update the loop that tracks seen_poll_task_id (and/or
introduce a counter like pinned_task_id_count) to record when a non-final poll
exposes a task_id and assert afterwards that there are at least two polls with
the same non-empty task_id (i.e. the seen_poll_task_id was observed on a prior
non-terminal poll and again on a later poll), while keeping the existing
terminal-status checks (final, _TERMINAL_STATUSES, ResearchStatus.COMPLETED)
intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 94e99ab5-2021-4df2-b17c-cd63799c47e1

📥 Commits

Reviewing files that changed from the base of the PR and between 928eab3 and 1599f1c.

📒 Files selected for processing (2)
  • tests/cassettes/research_deep_poll_long.yaml
  • tests/integration/test_research_deep_poll_vcr.py

Comment thread tests/integration/test_research_deep_poll_vcr.py
claude added 2 commits June 12, 2026 09:30
The first trim round-tripped the cassette through yaml.safe_dump, which
re-wrapped the long URL-encoded request/response scalars differently from
VCR's native serializer (max line 224167 vs 222932). On Windows CI the
re-wrapped scalars parsed to slightly different bytes, so the CREATE_NOTEBOOK
(CCqFvf) request no longer matched its recorded entry and the replay raised
CannotOverwriteExistingCassetteException (Linux/macOS were unaffected).

Re-trim by slicing the VCR-native recording at interaction boundaries (no
yaml re-emission), so every kept interaction is a byte-exact slice of the
original recording. Verified: kept create/completed-poll/delete interactions
are identical to the source recording, cassette guards pass, replay is green
(~12s, 0.855 MB), and the cassette has no live credentials.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Require the cassette to exercise the task-id pinning path: assert at least
  two polls surface the same non-empty task_id, so a later filtered poll
  reuses the captured id (CodeRabbit: a cassette where task_id only appears on
  the terminal poll would skip the pinned-filter path). The recorded cassette
  has 3 in_progress/completed polls all carrying the same task_id.
- Add a defensive `assert poll is not None` in `_poll_until_terminal` to keep
  the fail-fast invariant explicit before attribute access (gemini).
- Document in `_poll_until_terminal` that NOT_FOUND / NO_RESEARCH are not
  terminal and continue the loop by design (claude review).
- Drop the redundant `@pytest.mark.vcr` from the test method; it is already
  applied module-wide via `pytestmark` (claude review).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@teng-lin

Copy link
Copy Markdown
Owner Author

Thanks for the review @claude. Addressed both actionable items in 7627b68:

  1. Added a docstring note to _poll_until_terminal that NOT_FOUND / NO_RESEARCH are not terminal and continue the loop by design (mirrors wait_for_completion's replication-lag policy).
  2. Removed the redundant per-method @pytest.mark.vcr (already applied module-wide via pytestmark).

The fast_sleep record-mode monkeypatch nit and the cassette-ordering note were both flagged as no-action-required, so I left them as-is.

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @teng-lin's task in 46s —— View job


Follow-up verification — 7627b68

  • Read current test file
  • Confirm docstring note on _poll_until_terminal
  • Confirm @pytest.mark.vcr removal

Both items from the earlier review are correctly in place in 7627b68.

1. NOT_FOUND/NO_RESEARCH docstring note (test_research_deep_poll_vcr.py:327–331)

    Only ``COMPLETED`` and ``FAILED`` break the loop. ``NOT_FOUND`` /
    ``NO_RESEARCH`` are NOT treated as terminalthey continue polling
    (mirrors ``wait_for_completion``'s replication-lag policy: a pinned task
    temporarily absent from a poll is a transient condition, not an end
    state), bounded by :data:`_MAX_POLLS`.

Exactly what was requested. The wording explains the why (replication-lag policy, transient condition) rather than just stating the behaviour, which is what makes it useful in a debug session.

2. Redundant @pytest.mark.vcr removed (test_research_deep_poll_vcr.py:367–369)

    # ``pytest.mark.vcr`` is applied module-wide via ``pytestmark`` — no need
    # to repeat it here.
    @pytest.mark.timeout(_RECORD_TEST_TIMEOUT_SECONDS)
    @pytest.mark.asyncio
    @notebooklm_vcr.use_cassette(CASSETTE_NAME)
    async def test_deep_research_polling_loop(self, fast_sleep: None) -> None:

Gone, and the inline comment makes the intention explicit for future readers.

Bonus: gemini-code-assist and CodeRabbit items also landed cleanly

  • assert poll is not None (line 345) — the fail-fast narrowing gemini requested is in.
  • polls_with_task_id >= 2 assertion (lines 459–463) — the CodeRabbit pinned-task-id coverage assertion is in, with a clear message that shows the observed count on failure.

Everything from the review is addressed. The PR looks good to merge.

@teng-lin teng-lin merged commit 8bcbef9 into main Jun 12, 2026
32 checks passed
@teng-lin teng-lin deleted the feat/deep-research-completed-cassette branch June 12, 2026 13:49
teng-lin added a commit that referenced this pull request Jun 12, 2026
…cipe (#1567)

* docs: document the long-running (deep-research) cassette re-record recipe

Capture the non-obvious snags from the v0.8 full-lifecycle Deep Research
re-record (PR #1566) under docs/development.md "Long-running recordings":
record-mode httpx tuning to dodge httpx.PoolTimeout (timeout=60 +
ConnectionLimits(keepalive_expiry=10); async_client_factory is not public),
@pytest.mark.timeout(3600), the deep start() task_id != poll-reported id
NOT_FOUND trap, byte-exact YAML trimming (yaml.safe_dump breaks Windows
parsing), and credential-grep beyond the name-anchored scrubber.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants