Skip to content

Commit 025f5b1

Browse files
authored
Merge pull request #22 from scrapinghub/additional-requests
implementation of additional requests
2 parents 33dbdb5 + 753e6ad commit 025f5b1

16 files changed

+1709
-86
lines changed

CHANGELOG.rst

+4
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ Changelog
55
TBR
66
------------------
77

8+
* Added support for performing additional requests using
9+
``web_poet.HttpClient``.
10+
* Introduced ``web_poet.Meta`` to pass arbitrary information
11+
inside a Page Object.
812
* added a ``PageObjectRegistry`` class which has the ``handle_urls`` decorator
913
to conveniently declare and collect ``OverrideRule``.
1014
* removed support for Python 3.6

docs/advanced/additional-requests.rst

+921
Large diffs are not rendered by default.

docs/advanced/meta.rst

+134
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
.. _`advanced-meta`:
2+
3+
============================
4+
Passing information via Meta
5+
============================
6+
7+
In some cases, Page Objects might require additional information to be passed to
8+
them. Such information can dictate the behavior of the Page Object or affect its
9+
data entirely depending on the needs of the developer.
10+
11+
If you can recall from the previous basic tutorials, one essential requirement of
12+
Page Objects that inherit from :class:`~.WebPage` or :class:`~.ItemWebPage` would
13+
be :class:`~.HttpResponse`. This holds the HTTP response information that the
14+
Page Object is trying to represent.
15+
16+
In order to standardize how to pass arbitrary information inside Page Objects,
17+
we'll need to use :class:`~.Meta` similar on how we use :class:`~.HttpResponse`
18+
as a requirement to instantiate Page Objects:
19+
20+
.. code-block:: python
21+
22+
import attrs
23+
import web_poet
24+
25+
@attrs.define
26+
class SomePage(web_poet.ItemWebPage):
27+
# The HttpResponse attribute is inherited from ItemWebPage
28+
meta: web_poet.Meta
29+
30+
# Assume that it's constructed with the necessary arguments taken somewhere.
31+
response = web_poet.HttpResponse(...)
32+
33+
# It uses Python's dict interface.
34+
meta = web_poet.Meta({"arbitrary_value": 1234, "cool": True})
35+
36+
page = SomePage(response=response, meta=meta)
37+
38+
However, similar with :class:`~.HttpResponse`, developers using :class:`~.Meta`
39+
shouldn't care about how they are being passed into Page Objects. This will
40+
depend on the framework that would use **web-poet**.
41+
42+
Let's checkout some examples on how to use it inside a Page Object.
43+
44+
Controlling item values
45+
-----------------------
46+
47+
.. code-block:: python
48+
49+
import attrs
50+
import web_poet
51+
52+
53+
@attrs.define
54+
class ProductPage(web_poet.ItemWebPage):
55+
meta: web_poet.Meta
56+
57+
default_tax_rate = 0.10
58+
59+
def to_item(self):
60+
item = {
61+
"url": self.url,
62+
"name": self.css("#main h3.name ::text").get(),
63+
"price": self.css("#main .price ::text").get(),
64+
}
65+
self.calculate_price_with_tax(item)
66+
return item
67+
68+
@staticmethod
69+
def calculate_price_with_tax(item):
70+
tax_rate = self.meta.get("tax_rate") or self.default_tax_rate
71+
item["price_with_tax"] = item["price"] * (1 + tax_rate)
72+
73+
74+
From the example above, we were able to provide an optional information regarding
75+
the **tax rate** of the product. This could be useful when trying to support
76+
the different tax rates for each state or territory. However, since we're treating
77+
the **tax_rate** as optional information, notice that we also have a the
78+
``default_tax_rate`` as a backup value just in case it's not available.
79+
80+
81+
Controlling Page Object behavior
82+
--------------------------------
83+
84+
Let's try an example wherein :class:`~.Meta` is able to control how
85+
:ref:`advanced-requests` are being used. Specifically, we are going to use
86+
:class:`~.Meta` to control the number of paginations being made.
87+
88+
.. code-block:: python
89+
90+
from typing import List
91+
92+
import attrs
93+
import web_poet
94+
95+
96+
@attrs.define
97+
class ProductPage(web_poet.ItemWebPage):
98+
http_client: web_poet.HttpClient
99+
meta: web_poet.Meta
100+
101+
default_max_pages = 5
102+
103+
async def to_item(self):
104+
return {"product_urls": await self.get_product_urls()}
105+
106+
async def get_product_urls(self) -> List[str]:
107+
# Simulates scrolling to the bottom of the page to load the next
108+
# set of items in an "Infinite Scrolling" category list page.
109+
max_pages = self.meta.get("max_pages") or self.default_max_pages
110+
requests = [
111+
self.create_next_page_request(page_num)
112+
for page_num in range(2, max_pages + 1)
113+
]
114+
responses = await http_client.batch_execute(*requests)
115+
return [
116+
url
117+
for response in responses
118+
for product_urls in self.parse_product_urls(response)
119+
for url in product_urls
120+
]
121+
122+
@staticmethod
123+
def create_next_page_request(page_num):
124+
next_page_url = f"https://example.com/category/products?page={page_num}"
125+
return web_poet.Request(url=next_page_url)
126+
127+
@staticmethod
128+
def parse_product_urls(response: web_poet.HttpResponse):
129+
return response.css("#main .products a.link ::attr(href)").getall()
130+
131+
From the example above, we can see how :class:`~.Meta` is able to arbitrarily
132+
limit the pagination behavior by passing an optional **max_pages** info. Take
133+
note that a ``default_max_pages`` value is also present in the Page Object in
134+
case the :class:`~.Meta` instance did not provide it.

docs/api_reference.rst

+17-2
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ Page Inputs
88
===========
99

1010
.. automodule:: web_poet.page_inputs
11-
:members:
12-
:undoc-members:
11+
:members:
12+
:undoc-members:
1313

1414
Pages
1515
=====
@@ -48,6 +48,21 @@ Mixins
4848
:members:
4949
:no-special-members:
5050

51+
Requests
52+
========
53+
54+
.. automodule:: web_poet.requests
55+
:members:
56+
:undoc-members:
57+
58+
Exceptions
59+
==========
60+
61+
.. automodule:: web_poet.exceptions.core
62+
:members:
63+
64+
.. automodule:: web_poet.exceptions.http
65+
:members:
5166

5267
.. _`api-overrides`:
5368

docs/conf.py

+1
Original file line numberDiff line numberDiff line change
@@ -194,4 +194,5 @@
194194
'scrapy': ('https://docs.scrapy.org/en/latest', None, ),
195195
'url-matcher': ('https://url-matcher.readthedocs.io/en/stable/', None, ),
196196
'parsel': ('https://parsel.readthedocs.io/en/latest/', None, ),
197+
'multidict': ('https://multidict.readthedocs.io/en/latest/', None, ),
197198
}

docs/index.rst

+7
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,13 @@ and the motivation behind ``web-poet``, start with :ref:`from-ground-up`.
3535
intro/from-ground-up
3636
intro/overrides
3737

38+
.. toctree::
39+
:caption: Advanced
40+
:maxdepth: 1
41+
42+
advanced/additional-requests
43+
advanced/meta
44+
3845
.. toctree::
3946
:caption: Reference
4047
:maxdepth: 1

0 commit comments

Comments
 (0)