test(research): record full-lifecycle Deep Research cassette reaching completed#1566
Conversation
… completed The prior `research_deep_poll_long.yaml` was scoped to START_DEEP_RESEARCH + 6 in-progress POLL_RESEARCH calls and never reached a terminal state, leaving `research wait`-to-completion uncovered. Two earlier full-lifecycle recording attempts died with `httpx.PoolTimeout` ~15-20 min into the idle poll window. This re-records the full lifecycle to `completed`: - Poll-until-terminal loop: polls `research.poll` every 30s until a terminal `ResearchStatus` (COMPLETED/FAILED), capped at 80 polls. Replay asserts the final status is `ResearchStatus.COMPLETED` via enum membership, and a cassette-inspection test decodes the last POLL body through the real decoder + parser to confirm the terminal state independently of the client. - PoolTimeout-resilient recording client (record mode only): a generous 60s pool-acquire/read timeout plus a short 10s keepalive_expiry so an idle connection across the multi-minute poll gap is closed proactively and a fresh one is opened on the next poll. Both knobs go through the public constructor (`timeout=` -> httpx pool timeout; `limits=ConnectionLimits( keepalive_expiry=...)`), so no private seam is needed and replay is unaffected. A `@pytest.mark.timeout(3600)` overrides the global 60s pytest-timeout for the long live recording. - Task-id pinning fix: Deep Research's `start()` task_id does NOT equal the poll-reported task_id (verified live), so the loop mirrors `wait_for_completion` — first poll unfiltered, then pins the POLL-reported id. Threading start's id made every poll return NOT_FOUND. - Trimming: the full recording's accumulating in_progress report bodies blew past a usable replay time (the cassette scrubber is super-linear on the large markdown payloads). Trimmed to the in_progress -> completed skeleton (3 polls) plus the create/add_text/start/delete lifecycle. Final cassette is 0.815 MB (well under the 5 MB cap); replay runs in ~12s. Cassette scrubbed on record and verified clean: cassette guards pass and a manual grep found no live cookies, OAuth/Bearer tokens, API keys, account email, or SAPISIDHASH. The scratch notebook is self-cleaned via try/finally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@claude review |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR refactors a Deep Research integration test to capture the full poll-until-terminal lifecycle instead of fixed-iteration polling, adds record-mode client hardening, implements a ChangesDeep Research Polling Lifecycle Test
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude finished @teng-lin's task in 3m 4s —— View job Code Review — PR #1566: Full-lifecycle Deep Research cassette
This is a well-designed PR that solves a genuine infrastructure problem (PoolTimeout during recording, task-id mismatch between
|
There was a problem hiding this comment.
Code Review
This pull request updates the integration test test_research_deep_poll_vcr.py to record and replay the full-lifecycle Deep Research polling loop until it reaches a terminal state, replacing the previous scoped-down approach. It addresses previous httpx.PoolTimeout issues during recording by applying resilient client configurations (generous timeouts and short keepalive expiry) only in record mode. Additionally, it updates the polling logic to pin the task ID and assert that the final status is COMPLETED, while providing guidelines for trimming cassettes to stay under the 5 MB size limit. The code reviewer suggested adding a defensive assertion in _poll_until_terminal to verify that the poll response is not None before accessing its attributes, ensuring type safety and fail-fast behavior.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/integration/test_research_deep_poll_vcr.py`:
- Around line 413-452: The test must assert the cassette exercised the
pinned-task-id path by verifying a non-empty poll.task_id appears in a non-final
poll and is reused later; update the loop that tracks seen_poll_task_id (and/or
introduce a counter like pinned_task_id_count) to record when a non-final poll
exposes a task_id and assert afterwards that there are at least two polls with
the same non-empty task_id (i.e. the seen_poll_task_id was observed on a prior
non-terminal poll and again on a later poll), while keeping the existing
terminal-status checks (final, _TERMINAL_STATUSES, ResearchStatus.COMPLETED)
intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 94e99ab5-2021-4df2-b17c-cd63799c47e1
📒 Files selected for processing (2)
tests/cassettes/research_deep_poll_long.yamltests/integration/test_research_deep_poll_vcr.py
The first trim round-tripped the cassette through yaml.safe_dump, which re-wrapped the long URL-encoded request/response scalars differently from VCR's native serializer (max line 224167 vs 222932). On Windows CI the re-wrapped scalars parsed to slightly different bytes, so the CREATE_NOTEBOOK (CCqFvf) request no longer matched its recorded entry and the replay raised CannotOverwriteExistingCassetteException (Linux/macOS were unaffected). Re-trim by slicing the VCR-native recording at interaction boundaries (no yaml re-emission), so every kept interaction is a byte-exact slice of the original recording. Verified: kept create/completed-poll/delete interactions are identical to the source recording, cassette guards pass, replay is green (~12s, 0.855 MB), and the cassette has no live credentials. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Require the cassette to exercise the task-id pinning path: assert at least two polls surface the same non-empty task_id, so a later filtered poll reuses the captured id (CodeRabbit: a cassette where task_id only appears on the terminal poll would skip the pinned-filter path). The recorded cassette has 3 in_progress/completed polls all carrying the same task_id. - Add a defensive `assert poll is not None` in `_poll_until_terminal` to keep the fail-fast invariant explicit before attribute access (gemini). - Document in `_poll_until_terminal` that NOT_FOUND / NO_RESEARCH are not terminal and continue the loop by design (claude review). - Drop the redundant `@pytest.mark.vcr` from the test method; it is already applied module-wide via `pytestmark` (claude review). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the review @claude. Addressed both actionable items in 7627b68:
The |
|
Claude finished @teng-lin's task in 46s —— View job Follow-up verification — 7627b68
Both items from the earlier review are correctly in place in 7627b68. 1. Only ``COMPLETED`` and ``FAILED`` break the loop. ``NOT_FOUND`` /
``NO_RESEARCH`` are NOT treated as terminal — they continue polling
(mirrors ``wait_for_completion``'s replication-lag policy: a pinned task
temporarily absent from a poll is a transient condition, not an end
state), bounded by :data:`_MAX_POLLS`.Exactly what was requested. The wording explains the why (replication-lag policy, transient condition) rather than just stating the behaviour, which is what makes it useful in a debug session. 2. Redundant # ``pytest.mark.vcr`` is applied module-wide via ``pytestmark`` — no need
# to repeat it here.
@pytest.mark.timeout(_RECORD_TEST_TIMEOUT_SECONDS)
@pytest.mark.asyncio
@notebooklm_vcr.use_cassette(CASSETTE_NAME)
async def test_deep_research_polling_loop(self, fast_sleep: None) -> None:Gone, and the inline comment makes the intention explicit for future readers. Bonus: gemini-code-assist and CodeRabbit items also landed cleanly
Everything from the review is addressed. The PR looks good to merge. |
…cipe (#1567) * docs: document the long-running (deep-research) cassette re-record recipe Capture the non-obvious snags from the v0.8 full-lifecycle Deep Research re-record (PR #1566) under docs/development.md "Long-running recordings": record-mode httpx tuning to dodge httpx.PoolTimeout (timeout=60 + ConnectionLimits(keepalive_expiry=10); async_client_factory is not public), @pytest.mark.timeout(3600), the deep start() task_id != poll-reported id NOT_FOUND trap, byte-exact YAML trimming (yaml.safe_dump breaks Windows parsing), and credential-grep beyond the name-anchored scrubber. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Summary
The committed
tests/cassettes/research_deep_poll_long.yamlwas deliberately scoped toSTART_DEEP_RESEARCH+ 6 in-progressPOLL_RESEARCHcalls and never reached a terminal state, soresearch wait-to-completion against a real deep outcome was uncovered. Two earlier full-lifecycle recording attempts (2026-05-15) died withhttpx.PoolTimeout~15-20 min into the idle poll window.This PR re-records the full Deep Research lifecycle to
completedand wires it into the replay test.What changed
Poll-until-terminal restructure
_poll_until_terminalreplaces the old fixed-count_poll_n_times: it pollsresearch.pollevery 30s until a terminalResearchStatus(COMPLETED/FAILED), capped at 80 polls (~40 min) so a hung run can't spin forever. The replay test now asserts the final status isResearchStatus.COMPLETED(enum membership, not a string), and a separate cassette-inspection test decodes the lastPOLL_RESEARCHbody through the project's owndecode_response+parse_research_task_modelsto confirm the terminal state independently of the live client.PoolTimeout-resilient recording client (record mode only)
The previous
PoolTimeoutwas stale-connection reuse: the default 30s keepalive let the pool hand back a half-dead connection on a later poll across the multi-minute idle gap. In record mode the client is now built with a generous 60s pool-acquire/read timeout and a short 10skeepalive_expiry, so an idle connection is closed proactively and a fresh one is opened on the next poll. Both knobs go through the public constructor (timeout=→ httpx'spoolacquire timeout;limits=ConnectionLimits(keepalive_expiry=...)), so no private seam is needed and replay is unaffected. A@pytest.mark.timeout(3600)overrides the global 60spytest-timeoutfor the long live recording.Task-id pinning fix (load-bearing)
Verified live: Deep Research's
start()task_id does not equal the task_id the poll endpoint reports for the same task. Threading start's id made every poll returnNOT_FOUND. The loop now mirrorswait_for_completion— first poll unfiltered, then pins the POLL-reported id forward.Trimming + scrubbing
The full recording's accumulating in-progress report bodies grew to ~400 KB (the cassette scrubber is super-linear on large markdown payloads, which blew up replay time). Trimmed to the
in_progress → completedskeleton (3 polls) plus thecreate / add_text / start / deletelifecycle. The cassette was scrubbed on record and verified clean — cassette guards pass and a manual grep found no live cookies, OAuth/Bearer tokens, API keys, account email, orSAPISIDHASH. The scratch notebook self-cleans viatry/finally.Results
completed(no_research/in_progress → completed; PoolTimeout never recurred).tests/_guardrails/(1022 tests),mypy src/notebooklm, andpre-commit run --all-filesall clean.🤖 Generated with Claude Code
Summary by CodeRabbit