Skip to content

docs: Add guide about integrating Stagehand #1290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 21, 2025

Conversation

Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Jul 8, 2025

Description

  • Add guide about integrating stagehand-python v.0.4.0

Issues

@Mantisus Mantisus requested review from vdusek and Pijukatel July 8, 2025 02:31
@Mantisus
Copy link
Collaborator Author

Mantisus commented Jul 8, 2025

I had to use cast to avoid bloating the guide for the sake of typing.

@Mantisus Mantisus self-assigned this Jul 8, 2025
Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool tool and nice guide. I have just small comments about the CrawleeStagehandPage wrapper

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job Max!

The integration itself is not as easy as I expected. Maybe this could show us the direction in which we could improve/simplify the browsers/Playwright-related interface.

And/or we could introduce a dedicated crawler to this directly in Crawlee, something like PlaywrightStagehandCrawler. Then the guide could focus solely on its usage, showing how to use AI-based selectors for web scraping.

Let's further discuss it with @B4nan and maybe @janbuchar once they're back from their vacations.

@@ -0,0 +1,66 @@
---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we can't use "Run on Apify" for these examples as it contains more than 1 file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.

And in one file, it would look very cumbersome

@Mantisus
Copy link
Collaborator Author

Mantisus commented Jul 8, 2025

The integration itself is not as easy as I expected. Maybe this could show us the direction in which we could improve/simplify the browsers/Playwright-related interface.

I think the integration comes out more complicated because of the current Stagehand API. Even though it's a wrapper around Playwright and the documentation says it's the same Playwright but with AI capabilities. The current code doesn't match that.

I hope that they will improve their API and then the guide can be simplified

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me, just some minor things to address at will.


self._total_opened_pages += 1

# Wrap StagehandPage to provide Playwright Page interface
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems inaccurate

Comment on lines +59 to +68
pw_page = page._page # noqa: SLF001

# Handle page close event
pw_page.on(event='close', f=self._on_page_close)

# Update internal state
self._pages.append(pw_page)
self._last_page_opened_at = datetime.now(timezone.utc)

self._total_opened_pages += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is quite a bit of code copied over from PlaywrightBrowserController, isn't it? Any chance we could improve the PlaywrightBrowserController internal API so that integrating libraries that extend Playwright is easier?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps context creation should be put into a separate public method. As well as updating states. That would make the same thing a bit cleaner.

But I would say that the main problem with this integration is that you have to do for example, this - pw_page = page._page.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great, just a few comments 🙂

Mantisus and others added 5 commits July 21, 2025 15:38
Co-authored-by: Vlada Dusek <[email protected]>
Co-authored-by: Vlada Dusek <[email protected]>
This PR contains the following updates:

| Update | Change |
|---|---|
| lockFileMaintenance | All locks refreshed |

🔧 This Pull Request updates lock files to use the latest dependency
versions.

---

### Configuration

📅 **Schedule**: Branch creation - "before 4am on monday" (UTC),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the
rebase/retry checkbox.

👻 **Immortal**: This PR will be recreated if closed unmerged. Get
[config
help](https://redirect.github.com/renovatebot/renovate/discussions) if
that's undesired.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR was generated by [Mend Renovate](https://mend.io/renovate/).
View the [repository job
log](https://developer.mend.io/github/apify/crawlee-python).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0MS4yMy4yIiwidXBkYXRlZEluVmVyIjoiNDEuMjMuMiIsInRhcmdldEJyYW5jaCI6Im1hc3RlciIsImxhYmVscyI6W119-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
@Mantisus Mantisus requested a review from vdusek July 21, 2025 13:05
@vdusek vdusek merged commit 439d81e into apify:master Jul 21, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate Stagehand into PlaywrightCrawler
4 participants