Is there a more efficient way to block unwanted downloads in PlaywrightCrawler? #1534

loic-bellinger · 2025-11-06T13:24:04Z

loic-bellinger
Nov 6, 2025

Hi team

I’m using PlaywrightCrawler (Python) and trying to prevent unwanted PDF/media downloads from URLs that cannot be filtered by the exclude parameter of enqueue_links (or are hit after redirections).

Currently, I’m thinking of doing this in a pre_navigation_hook by setting up a route handler:

@crawler.pre_navigation_hook
async def block_download(context: PlaywrightPreNavCrawlingContext) -> None:
    async def block(route):
        if route.request.resource_type == "document":
            await route.abort()
            context.request.no_retry = True
        else:
            await route.continue_()

    await context.page.route("**/*", block)

This seem to work, but I’m not sure it’s ideal:

Is resource_type really reliable?
It defines a new function for every request and requires context.page.route which feels somewhat inefficient.

I did use browser_new_context_options={"accept_downloads": False} when instantiating my crawler to avoid downloading the files anyway, but I don't want it to spend time trying to download & retrying to download stuff.

Any help would be appreciated

Below a code with an unwanted URL to highlight my issue and make prototyping faster:

import asyncio

from crawlee.crawlers import (
    PlaywrightCrawler,
    PlaywrightCrawlingContext,
    PlaywrightPreNavCrawlingContext,
)


async def main() -> None:
    crawler = PlaywrightCrawler(
        headless=True,
        browser_type='chromium',
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')



    @crawler.pre_navigation_hook
    async def block_download(context: PlaywrightPreNavCrawlingContext) -> None:
        context.log.info(f'Navigating to {context.page.url} ...')

        async def block(route):
            if route.request.resource_type == "document":
                await route.abort()
                context.request.no_retry = True
            else:
                await route.continue_()

        await context.page.route("**/*", block)


    await crawler.run(['https://www.enseignementsup-recherche.gouv.fr/media/32517/download'])


if __name__ == '__main__':
    asyncio.run(main())

Pijukatel · 2025-11-06T15:30:27Z

Pijukatel
Nov 6, 2025
Maintainer

Hello, I think that using the pre-navigation hooks is the common way to deal with similar problems.
Please take a look at this example code that shows using context.block_requests:
https://crawlee.dev/python/docs/examples/playwright-crawler-with-block-requests

1 reply

loic-bellinger Nov 7, 2025
Author

I should have specified that context.block_requests do not help since the URLs triggering the downloads do not have specific patterns.

Since I cannot properly block navigation to the download I'm following the doc and stopping retry

@crawler.error_handler
async def retry_handler(context: BasicCrawlingContext, error: Exception) -> None:
    if not isinstance(error, (SessionError, HttpStatusCodeError)):
        context.request.no_retry = True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there a more efficient way to block unwanted downloads in PlaywrightCrawler? #1534

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Is there a more efficient way to block unwanted downloads in PlaywrightCrawler? #1534

Uh oh!

Uh oh!

loic-bellinger Nov 6, 2025

Replies: 1 comment · 1 reply

Uh oh!

Pijukatel Nov 6, 2025 Maintainer

Uh oh!

Uh oh!

loic-bellinger Nov 7, 2025 Author

loic-bellinger
Nov 6, 2025

Replies: 1 comment 1 reply

Pijukatel
Nov 6, 2025
Maintainer

loic-bellinger Nov 7, 2025
Author