Skip to content

Commit 396ab8e

Browse files
authored
Merge pull request #23 from scrapinghub/meta
Introduce Meta as a way to pass information inside a PO
2 parents 355294e + 079ccc1 commit 396ab8e

File tree

7 files changed

+159
-3
lines changed

7 files changed

+159
-3
lines changed

CHANGELOG.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ TBR
99
* Added support for Python 3.10
1010
* Added support for performing additional requests using
1111
``web_poet.HttpClient``.
12+
* Introduced ``web_poet.Meta`` to pass arbitrary information
13+
inside a Page Object.
1214

1315

1416
0.1.1 (2021-06-02)

docs/advanced/additional_requests.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,8 @@ to extract more images in a product page that might not otherwise be possible.
7777
This is because in order to do so, an additional button needs to be clicked
7878
which fetches the complete set of product images via AJAX.
7979

80+
.. _`request-post-example`:
81+
8082
A ``POST`` request with `header` and `body`
8183
-------------------------------------------
8284

docs/advanced/meta.rst

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
.. _`advanced-meta`:
2+
3+
============================
4+
Passing information via Meta
5+
============================
6+
7+
In some cases, Page Objects might require additional information to be passed to
8+
them. Such information can dictate the behavior of the Page Object or affect its
9+
data entirely depending on the needs of the developer.
10+
11+
If you can recall from the previous basic tutorials, one essential requirement of
12+
Page Objects that inherit from :class:`~.WebPage` or :class:`~.ItemWebPage` would
13+
be :class:`~.ResponseData`. This holds the HTTP response information that the
14+
Page Object is trying to represent.
15+
16+
In order to standardize how to pass arbitrary information inside Page Objects,
17+
we'll need to use :class:`~.Meta` similar on how we use :class:`~.ResponseData`
18+
as a requirement to instantiate Page Objects:
19+
20+
.. code-block:: python
21+
22+
import attr
23+
import web_poet
24+
25+
@attr.define
26+
class SomePage(web_poet.ItemWebPage):
27+
# ResponseData is inherited from ItemWebPage
28+
meta: web_poet.Meta
29+
30+
response = web_poet.ResponseData(...)
31+
meta = web_poet.Meta("arbitrary_value": 1234, "cool": True)
32+
33+
page = SomePage(response=response, meta=meta)
34+
35+
However, similar with :class:`~.ResponseData`, developers using :class:`~.Meta`
36+
shouldn't care about how they are being passed into Page Objects. This will
37+
depend on the framework that would use **web-poet**.
38+
39+
Let's checkout some examples on how to use it inside a Page Object.
40+
41+
Controlling item values
42+
-----------------------
43+
44+
.. code-block:: python
45+
46+
import attr
47+
import web_poet
48+
49+
50+
@attr.define
51+
class ProductPage(web_poet.ItemWebPage):
52+
meta: web_poet.Meta
53+
54+
default_tax_rate = 0.10
55+
56+
def to_item(self):
57+
item = {
58+
"url": self.url,
59+
"name": self.css("#main h3.name ::text").get(),
60+
"price": self.css("#main .price ::text").get(),
61+
}
62+
self.calculate_price_with_tax(item)
63+
return item
64+
65+
@staticmethod
66+
def calculate_price_with_tax(item):
67+
tax_rate = self.meta.get("tax_rate") or self.default_tax_rate
68+
item["price_with_tax"] = item["price"] * (1 + tax_rate)
69+
70+
71+
From the example above, we were able to provide an optional information regarding
72+
the **tax rate** of the product. This could be useful when trying to support
73+
the different tax rates for each state or territory. However, since we're treating
74+
the **tax_rate** as optional information, notice that we also have a the
75+
``default_tax_rate`` as a backup value just in case it's not available.
76+
77+
78+
Controlling Page Object behavior
79+
--------------------------------
80+
81+
Let's try an example wherein :class:`~.Meta` is able to control how
82+
:ref:`advanced-requests` are being used. Specifically, we are going to use
83+
:class:`~.Meta` to control the number of paginations being made.
84+
85+
.. code-block:: python
86+
87+
from typing import List
88+
89+
import attr
90+
import web_poet
91+
92+
93+
@attr.define
94+
class ProductPage(web_poet.ItemWebPage):
95+
http_client: web_poet.HttpClient
96+
meta: web_poet.Meta
97+
98+
default_max_pages = 5
99+
100+
async def to_item(self):
101+
return {"product_urls": await self.get_product_urls()}
102+
103+
async def get_product_urls(self) -> List[str]:
104+
# Simulates scrolling to the bottom of the page to load the next
105+
# set of items in an "Infinite Scrolling" category list page.
106+
max_pages = self.meta.get("max_pages") or self.default_max_pages
107+
requests = [
108+
self.create_next_page_request(page_num)
109+
for page_num in range(2, max_pages + 1)
110+
]
111+
responses = await http_client.batch_requests(*requests)
112+
pages = [self] + list(map(web_poet.WebPage, responses))
113+
return [
114+
product_url
115+
for page in pages
116+
for product_url in self.parse_product_urls(page)
117+
]
118+
119+
@staticmethod
120+
def create_next_page_request(page_num):
121+
next_page_url = f"https://example.com/category/products?page={page_num}"
122+
return web_poet.Request(url=next_page_url)
123+
124+
@staticmethod
125+
def parse_product_urls(page):
126+
return page.css("#main .products a.link ::attr(href)").getall()
127+
128+
From the example above, we can see how :class:`~.Meta` is able to arbitrarily
129+
limit the pagination behavior by passing an optional **max_pages** info. Take
130+
note that a ``default_max_pages`` value is also present in the Page Object in
131+
case the :class:`~.Meta` instance did not provide it.

docs/api_reference.rst

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,19 @@ Page Inputs
66
===========
77

88
.. automodule:: web_poet.page_inputs
9-
:members:
10-
:undoc-members:
9+
10+
.. autoclass:: ResponseData
11+
:show-inheritance:
12+
:members:
13+
:undoc-members:
14+
:inherited-members:
15+
:no-special-members:
16+
17+
.. autoclass:: Meta
18+
:show-inheritance:
19+
:members:
20+
:no-special-members:
21+
1122

1223
Pages
1324
=====

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ and the motivation behind ``web-poet``, start with :ref:`from-ground-up`.
3939
:maxdepth: 1
4040

4141
advanced/additional_requests
42+
advanced/meta
4243

4344
.. toctree::
4445
:caption: Reference

web_poet/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
from .pages import WebPage, ItemPage, ItemWebPage, Injectable
2-
from .page_inputs import ResponseData
2+
from .page_inputs import ResponseData, Meta
33
from .requests import request_backend_var, Request, HttpClient

web_poet/page_inputs.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,16 @@ class ResponseData:
2424
2525
``headers`` should contain the HTTP response headers.
2626
"""
27+
2728
url: str
2829
html: str
2930
status: Optional[int] = None
3031
headers: Optional[Dict[Union[str, ByteString], Any]] = None
32+
33+
34+
class Meta(dict):
35+
"""Container class that could contain any arbitrary data to be passed into
36+
a Page Object.
37+
"""
38+
39+
pass

0 commit comments

Comments
 (0)