Skip to content

Commit b02b2dd

Browse files
committed
add documentation
1 parent 74e5c89 commit b02b2dd

File tree

4 files changed

+361
-48
lines changed

4 files changed

+361
-48
lines changed

docs/advanced/additional_requests.rst

+278
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,278 @@
1+
.. _`advanced-requests`:
2+
3+
===================
4+
Additional Requests
5+
===================
6+
7+
Websites nowadays needs a lot of page interactions to display or load some key
8+
information. In most cases, these are done via AJAX requests. Some examples of these are:
9+
10+
* Clicking a button on a page to reveal other similar products.
11+
* Clicking the `"Load More"` button to retrieve more images of a given item.
12+
* Scrolling to the bottom of the page to load more items `(i.e. infinite scrolling)`.
13+
* Hovering that reveals a tool-tip containing additional page info.
14+
15+
As such, performing additional requests inside Page Objects are inevitable to
16+
properly extract data for some websites.
17+
18+
.. warning::
19+
20+
Additional requests made inside a Page Object aren't meant to represent
21+
the **Crawling Logic** at all. They are simply a low-level way to interact
22+
with today's websites which relies on a lot of page interactions to display
23+
its contents.
24+
25+
26+
HttpClient
27+
==========
28+
29+
The main interface for executing additional requests would be :class:`~.HttpClient`.
30+
It also has full support for :mod:`asyncio` enabling developers to perform
31+
the additional requests asynchronously.
32+
33+
Let's see a few quick examples to see how it's being used in action.
34+
35+
A simple ``GET`` request
36+
------------------------
37+
38+
.. code-block:: python
39+
40+
import attr
41+
import web_poet
42+
43+
44+
@attr.define
45+
class ProductPage(web_poet.ItemWebPage):
46+
http_client: web_poet.HttpClient
47+
48+
async def to_item(self):
49+
item = {
50+
"url": self.url,
51+
"name": self.css("#main h3.name ::text").get(),
52+
"product_id": self.css("#product ::attr(product-id)").get(),
53+
}
54+
55+
# Simulates clicking on a button that says "View All Images"
56+
response: web_poet.ResponseData = await self.http_client.get(
57+
f"https://api.example.com/v2/images?id={item['product_id']}"
58+
)
59+
page = web_poet.WebPage(response)
60+
61+
item["images"] = page.css(".product-images img::attr(src)").getall()
62+
return item
63+
64+
There are a few things to take note in this example:
65+
66+
* A ``GET`` request can be done via :class:`~.HttpClient`'s
67+
:meth:`~.HttpClient.get` method.
68+
* We're now using the ``async/await`` syntax.
69+
* The response is of type :class:`~.ResponseData`.
70+
71+
* Though in order to use :meth:`~.ResponseShortcutsMixin.css`
72+
`(and other shortcut methods)` we'll need to feed it into
73+
:class:`~.WebPage`.
74+
75+
As the example suggests, we're performing an additional request that allows us
76+
to extract more images in a product page that might not otherwise be possible.
77+
This is because in order to do so, an additional button needs to be clicked
78+
which fetches the complete set of product images via AJAX.
79+
80+
A ``POST`` request with `header` and `body`
81+
-------------------------------------------
82+
83+
Let's see another example which needs ``headers`` and ``body`` data to process
84+
additional requests.
85+
86+
In this example, we'll paginate related items in a carousel. These are
87+
usually lazily loaded by the website to reduce the amount of information
88+
rendered in the DOM that might not otherwise be viewed by all users anyway.
89+
90+
Thus, additional requests inside the Page Object is typically needed for it:
91+
92+
.. code-block:: python
93+
94+
import attr
95+
import web_poet
96+
97+
98+
@attr.define
99+
class ProductPage(web_poet.ItemWebPage):
100+
http_client: web_poet.HttpClient
101+
102+
async def to_item(self):
103+
item = {
104+
"url": self.url,
105+
"name": self.css("#main h3.name ::text").get(),
106+
"product_id": self.css("#product ::attr(product-id)").get(),
107+
"related_product_ids": self.parse_related_product_ids(self),
108+
}
109+
110+
# Simulates "scrolling" through a carousel that loads related product items
111+
response: web_poet.responseData = await self.http_client.post(
112+
url="https://www.api.example.com/related-products/",
113+
headers={
114+
'Host': 'www.example.com',
115+
'Content-Type': 'application/json; charset=UTF-8',
116+
},
117+
body=json.dumps(
118+
{
119+
"Page": 2,
120+
"ProductID": item["product_id"],
121+
}
122+
),
123+
)
124+
second_page = web_poet.WebPage(response)
125+
126+
related_product_ids = self.parse_related_product_ids(second_page)
127+
item["related_product_ids"] = related_product_ids
128+
return item
129+
130+
@staticmethod
131+
def parse_related_product_ids(page: web_poet.WebPage) -> List[str]:
132+
return page.css("#main .related-products ::attr(product-id)").getall()
133+
134+
Here's the key takeaway in this example:
135+
136+
* Similar to :class:`~.HttpClient`'s :meth:`~.HttpClient.get` method,
137+
a :meth:`~.HttpClient.post` method is also available that's
138+
typically used to submit forms.
139+
140+
Batch requests
141+
--------------
142+
143+
We can also choose to process requests by **batch** instead of sequentially.
144+
Let's modify the example in the previous section to see how it can be done:
145+
146+
.. code-block:: python
147+
148+
from typing import List
149+
150+
import attr
151+
import web_poet
152+
153+
154+
@attr.define
155+
class ProductPage(web_poet.ItemWebPage):
156+
http_client: web_poet.HttpClient
157+
158+
default_pagination_limit = 10
159+
160+
async def to_item(self):
161+
item = {
162+
"url": self.url,
163+
"name": self.css("#main h3.name ::text").get(),
164+
"product_id": self.css("#product ::attr(product-id)").get(),
165+
"related_product_ids": self.parse_related_product_ids(self),
166+
}
167+
168+
requests: List[web_poet.Request] = [
169+
self.create_request(page_num=page_num)
170+
for page_num in range(2, default_pagination_limit)
171+
]
172+
responses: List[web_poet.ResponseData] = await self.http_client.batch_requests(*requests)
173+
pages = map(web_poet.WebPage, responses)
174+
related_product_ids = [
175+
product_id
176+
for page in pages
177+
for product_id in self.parse_related_product_ids(page)
178+
]
179+
180+
item["related_product_ids"].extend(related_product_ids)
181+
return item
182+
183+
def create_request(self, page_num=2):
184+
# Simulates "scrolling" through a carousel that loads related product items
185+
return web_poet.Request(
186+
url="https://www.api.example.com/product-pagination/",
187+
method="POST",
188+
headers={
189+
'Host': 'www.example.com',
190+
'Content-Type': 'application/json; charset=UTF-8',
191+
},
192+
body=json.dumps(
193+
{
194+
"Page": page_num,
195+
"ProductID": item["product_id"],
196+
}
197+
),
198+
)
199+
200+
@staticmethod
201+
def parse_related_product_ids(page: web_poet.WebPage) -> List[str]:
202+
return page.css("#main .related-products ::attr(product-id)").getall()
203+
204+
The key takeaways for this example are:
205+
206+
* A :class:`~.Request` can be instantiated to represent a Generic HTTP Request.
207+
It only contains the HTTP Request information for now and isn't executed yet.
208+
This is useful for creating factory methods to help create them without any
209+
download execution at all.
210+
* :class:`~.HttpClient`' has a :meth:`~.HttpClient.batch_requests` method that
211+
can process a series of :class:`~.Request` instances.
212+
213+
* Note that it can accept different types of :class:`~.Request` that might
214+
not be related *(e.g. a mixture of* ``GET`` *and* ``POST`` *requests)*.
215+
This is useful to process them in batch to take advantage of async
216+
execution.
217+
218+
.. _advanced-downloader-impl:
219+
220+
Downloader Implementation
221+
=========================
222+
223+
Please note that on its own, :class:`~.HttpClient` doesn't do anything. It doesn't
224+
know how to execute the request on its own. Thus, for frameworks or projects
225+
wanting to use additional requests in Page Objects, they need to set the
226+
implementation of how to download :class:`~.Request`.
227+
228+
For more info on this, kindly read the API Specifications for :class:`~.HttpClient`.
229+
230+
In any case, frameworks that wish to support **web-poet** could provide the
231+
HTTP downloader implementation in two ways:
232+
233+
.. _setup-contextvars:
234+
235+
1. Context Variable
236+
-------------------
237+
238+
:mod:`contextvars` is natively supported in :mod:`asyncio` in order to set and
239+
access context-aware values. This means that the framework using **web-poet**
240+
can easily assign the implementation using the readily available :mod:`contextvars`
241+
instance named ``web_poet.request_backend_var``.
242+
243+
This can be set using:
244+
245+
.. code-block:: python
246+
247+
def request_implementation(r: web_poet.Request) -> web_poet.ResponseData:
248+
...
249+
250+
from web_poet import request_backend_var
251+
request_backend_var.set(request_implementation)
252+
253+
Setting this up would allow access to the request implementation in a
254+
:class:`~.HttpClient` instance which uses it by default.
255+
256+
.. warning::
257+
258+
If no value for ``web_poet.request_backend_var`` was set, then a
259+
:class:`~.RequestBackendError` is raised. However, no exception would
260+
be raised if **option 2** below is used.
261+
262+
263+
2. Dependency Injection
264+
-----------------------
265+
266+
The framework using **web-poet** might be using other libraries which doesn't
267+
have a full support to :mod:`contextvars` `(e.g. Twisted)`. With that, an
268+
alternative approach would be to supply the request implementation when creating
269+
an :class:`~.HttpClient` instance:
270+
271+
272+
.. code-block:: python
273+
274+
def request_implementation(r: web_poet.Request) -> web_poet.ResponseData:
275+
...
276+
277+
from web_poet import HttpClient
278+
http_client = HttpClient(request_downloader=request_implementation)

docs/api_reference.rst

+7
Original file line numberDiff line numberDiff line change
@@ -45,3 +45,10 @@ Mixins
4545
.. autoclass:: web_poet.mixins.ResponseShortcutsMixin
4646
:members:
4747
:no-special-members:
48+
49+
Requests
50+
========
51+
52+
.. automodule:: web_poet.requests
53+
:members:
54+
:undoc-members:

docs/index.rst

+6
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,12 @@ and the motivation behind ``web-poet``, start with :ref:`from-ground-up`.
3434
intro/tutorial
3535
intro/from-ground-up
3636

37+
.. toctree::
38+
:caption: Advanced
39+
:maxdepth: 1
40+
41+
advanced/additional_requests
42+
3743
.. toctree::
3844
:caption: Reference
3945
:maxdepth: 1

0 commit comments

Comments
 (0)