|
| 1 | +.. _`advanced-requests`: |
| 2 | + |
| 3 | +=================== |
| 4 | +Additional Requests |
| 5 | +=================== |
| 6 | + |
| 7 | +Websites nowadays needs a lot of page interactions to display or load some key |
| 8 | +information. In most cases, these are done via AJAX requests. Some examples of these are: |
| 9 | + |
| 10 | + * Clicking a button on a page to reveal other similar products. |
| 11 | + * Clicking the `"Load More"` button to retrieve more images of a given item. |
| 12 | + * Scrolling to the bottom of the page to load more items `(i.e. infinite scrolling)`. |
| 13 | + * Hovering that reveals a tool-tip containing additional page info. |
| 14 | + |
| 15 | +As such, performing additional requests inside Page Objects are inevitable to |
| 16 | +properly extract data for some websites. |
| 17 | + |
| 18 | +.. warning:: |
| 19 | + |
| 20 | + Additional requests made inside a Page Object aren't meant to represent |
| 21 | + the **Crawling Logic** at all. They are simply a low-level way to interact |
| 22 | + with today's websites which relies on a lot of page interactions to display |
| 23 | + its contents. |
| 24 | + |
| 25 | + |
| 26 | +HttpClient |
| 27 | +========== |
| 28 | + |
| 29 | +The main interface for executing additional requests would be :class:`~.HttpClient`. |
| 30 | +It also has full support for :mod:`asyncio` enabling developers to perform |
| 31 | +the additional requests asynchronously. |
| 32 | + |
| 33 | +Let's see a few quick examples to see how it's being used in action. |
| 34 | + |
| 35 | +A simple ``GET`` request |
| 36 | +------------------------ |
| 37 | + |
| 38 | +.. code-block:: python |
| 39 | +
|
| 40 | + import attr |
| 41 | + import web_poet |
| 42 | +
|
| 43 | +
|
| 44 | + @attr.define |
| 45 | + class ProductPage(web_poet.ItemWebPage): |
| 46 | + http_client: web_poet.HttpClient |
| 47 | +
|
| 48 | + async def to_item(self): |
| 49 | + item = { |
| 50 | + "url": self.url, |
| 51 | + "name": self.css("#main h3.name ::text").get(), |
| 52 | + "product_id": self.css("#product ::attr(product-id)").get(), |
| 53 | + } |
| 54 | +
|
| 55 | + # Simulates clicking on a button that says "View All Images" |
| 56 | + response: web_poet.ResponseData = await self.http_client.get( |
| 57 | + f"https://api.example.com/v2/images?id={item['product_id']}" |
| 58 | + ) |
| 59 | + page = web_poet.WebPage(response) |
| 60 | +
|
| 61 | + item["images"] = page.css(".product-images img::attr(src)").getall() |
| 62 | + return item |
| 63 | +
|
| 64 | +There are a few things to take note in this example: |
| 65 | + |
| 66 | + * A ``GET`` request can be done via :class:`~.HttpClient`'s |
| 67 | + :meth:`~.HttpClient.get` method. |
| 68 | + * We're now using the ``async/await`` syntax. |
| 69 | + * The response is of type :class:`~.ResponseData`. |
| 70 | + |
| 71 | + * Though in order to use :meth:`~.ResponseShortcutsMixin.css` |
| 72 | + `(and other shortcut methods)` we'll need to feed it into |
| 73 | + :class:`~.WebPage`. |
| 74 | + |
| 75 | +As the example suggests, we're performing an additional request that allows us |
| 76 | +to extract more images in a product page that might not otherwise be possible. |
| 77 | +This is because in order to do so, an additional button needs to be clicked |
| 78 | +which fetches the complete set of product images via AJAX. |
| 79 | + |
| 80 | +A ``POST`` request with `header` and `body` |
| 81 | +------------------------------------------- |
| 82 | + |
| 83 | +Let's see another example which needs ``headers`` and ``body`` data to process |
| 84 | +additional requests. |
| 85 | + |
| 86 | +In this example, we'll paginate related items in a carousel. These are |
| 87 | +usually lazily loaded by the website to reduce the amount of information |
| 88 | +rendered in the DOM that might not otherwise be viewed by all users anyway. |
| 89 | + |
| 90 | +Thus, additional requests inside the Page Object is typically needed for it: |
| 91 | + |
| 92 | +.. code-block:: python |
| 93 | +
|
| 94 | + import attr |
| 95 | + import web_poet |
| 96 | +
|
| 97 | +
|
| 98 | + @attr.define |
| 99 | + class ProductPage(web_poet.ItemWebPage): |
| 100 | + http_client: web_poet.HttpClient |
| 101 | +
|
| 102 | + async def to_item(self): |
| 103 | + item = { |
| 104 | + "url": self.url, |
| 105 | + "name": self.css("#main h3.name ::text").get(), |
| 106 | + "product_id": self.css("#product ::attr(product-id)").get(), |
| 107 | + "related_product_ids": self.parse_related_product_ids(self), |
| 108 | + } |
| 109 | +
|
| 110 | + # Simulates "scrolling" through a carousel that loads related product items |
| 111 | + response: web_poet.responseData = await self.http_client.post( |
| 112 | + url="https://www.api.example.com/related-products/", |
| 113 | + headers={ |
| 114 | + 'Host': 'www.example.com', |
| 115 | + 'Content-Type': 'application/json; charset=UTF-8', |
| 116 | + }, |
| 117 | + body=json.dumps( |
| 118 | + { |
| 119 | + "Page": 2, |
| 120 | + "ProductID": item["product_id"], |
| 121 | + } |
| 122 | + ), |
| 123 | + ) |
| 124 | + second_page = web_poet.WebPage(response) |
| 125 | +
|
| 126 | + related_product_ids = self.parse_related_product_ids(second_page) |
| 127 | + item["related_product_ids"] = related_product_ids |
| 128 | + return item |
| 129 | +
|
| 130 | + @staticmethod |
| 131 | + def parse_related_product_ids(page: web_poet.WebPage) -> List[str]: |
| 132 | + return page.css("#main .related-products ::attr(product-id)").getall() |
| 133 | +
|
| 134 | +Here's the key takeaway in this example: |
| 135 | + |
| 136 | + * Similar to :class:`~.HttpClient`'s :meth:`~.HttpClient.get` method, |
| 137 | + a :meth:`~.HttpClient.post` method is also available that's |
| 138 | + typically used to submit forms. |
| 139 | + |
| 140 | +Batch requests |
| 141 | +-------------- |
| 142 | + |
| 143 | +We can also choose to process requests by **batch** instead of sequentially. |
| 144 | +Let's modify the example in the previous section to see how it can be done: |
| 145 | + |
| 146 | +.. code-block:: python |
| 147 | +
|
| 148 | + from typing import List |
| 149 | +
|
| 150 | + import attr |
| 151 | + import web_poet |
| 152 | +
|
| 153 | +
|
| 154 | + @attr.define |
| 155 | + class ProductPage(web_poet.ItemWebPage): |
| 156 | + http_client: web_poet.HttpClient |
| 157 | +
|
| 158 | + default_pagination_limit = 10 |
| 159 | +
|
| 160 | + async def to_item(self): |
| 161 | + item = { |
| 162 | + "url": self.url, |
| 163 | + "name": self.css("#main h3.name ::text").get(), |
| 164 | + "product_id": self.css("#product ::attr(product-id)").get(), |
| 165 | + "related_product_ids": self.parse_related_product_ids(self), |
| 166 | + } |
| 167 | +
|
| 168 | + requests: List[web_poet.Request] = [ |
| 169 | + self.create_request(page_num=page_num) |
| 170 | + for page_num in range(2, default_pagination_limit) |
| 171 | + ] |
| 172 | + responses: List[web_poet.ResponseData] = await self.http_client.batch_requests(*requests) |
| 173 | + pages = map(web_poet.WebPage, responses) |
| 174 | + related_product_ids = [ |
| 175 | + product_id |
| 176 | + for page in pages |
| 177 | + for product_id in self.parse_related_product_ids(page) |
| 178 | + ] |
| 179 | +
|
| 180 | + item["related_product_ids"].extend(related_product_ids) |
| 181 | + return item |
| 182 | +
|
| 183 | + def create_request(self, page_num=2): |
| 184 | + # Simulates "scrolling" through a carousel that loads related product items |
| 185 | + return web_poet.Request( |
| 186 | + url="https://www.api.example.com/product-pagination/", |
| 187 | + method="POST", |
| 188 | + headers={ |
| 189 | + 'Host': 'www.example.com', |
| 190 | + 'Content-Type': 'application/json; charset=UTF-8', |
| 191 | + }, |
| 192 | + body=json.dumps( |
| 193 | + { |
| 194 | + "Page": page_num, |
| 195 | + "ProductID": item["product_id"], |
| 196 | + } |
| 197 | + ), |
| 198 | + ) |
| 199 | +
|
| 200 | + @staticmethod |
| 201 | + def parse_related_product_ids(page: web_poet.WebPage) -> List[str]: |
| 202 | + return page.css("#main .related-products ::attr(product-id)").getall() |
| 203 | +
|
| 204 | +The key takeaways for this example are: |
| 205 | + |
| 206 | + * A :class:`~.Request` can be instantiated to represent a Generic HTTP Request. |
| 207 | + It only contains the HTTP Request information for now and isn't executed yet. |
| 208 | + This is useful for creating factory methods to help create them without any |
| 209 | + download execution at all. |
| 210 | + * :class:`~.HttpClient`' has a :meth:`~.HttpClient.batch_requests` method that |
| 211 | + can process a series of :class:`~.Request` instances. |
| 212 | + |
| 213 | + * Note that it can accept different types of :class:`~.Request` that might |
| 214 | + not be related *(e.g. a mixture of* ``GET`` *and* ``POST`` *requests)*. |
| 215 | + This is useful to process them in batch to take advantage of async |
| 216 | + execution. |
| 217 | + |
| 218 | +.. _advanced-downloader-impl: |
| 219 | + |
| 220 | +Downloader Implementation |
| 221 | +========================= |
| 222 | + |
| 223 | +Please note that on its own, :class:`~.HttpClient` doesn't do anything. It doesn't |
| 224 | +know how to execute the request on its own. Thus, for frameworks or projects |
| 225 | +wanting to use additional requests in Page Objects, they need to set the |
| 226 | +implementation of how to download :class:`~.Request`. |
| 227 | + |
| 228 | +For more info on this, kindly read the API Specifications for :class:`~.HttpClient`. |
| 229 | + |
| 230 | +In any case, frameworks that wish to support **web-poet** could provide the |
| 231 | +HTTP downloader implementation in two ways: |
| 232 | + |
| 233 | +.. _setup-contextvars: |
| 234 | + |
| 235 | +1. Context Variable |
| 236 | +------------------- |
| 237 | + |
| 238 | +:mod:`contextvars` is natively supported in :mod:`asyncio` in order to set and |
| 239 | +access context-aware values. This means that the framework using **web-poet** |
| 240 | +can easily assign the implementation using the readily available :mod:`contextvars` |
| 241 | +instance named ``web_poet.request_backend_var``. |
| 242 | + |
| 243 | +This can be set using: |
| 244 | + |
| 245 | +.. code-block:: python |
| 246 | +
|
| 247 | + def request_implementation(r: web_poet.Request) -> web_poet.ResponseData: |
| 248 | + ... |
| 249 | +
|
| 250 | + from web_poet import request_backend_var |
| 251 | + request_backend_var.set(request_implementation) |
| 252 | +
|
| 253 | +Setting this up would allow access to the request implementation in a |
| 254 | +:class:`~.HttpClient` instance which uses it by default. |
| 255 | + |
| 256 | +.. warning:: |
| 257 | + |
| 258 | + If no value for ``web_poet.request_backend_var`` was set, then a |
| 259 | + :class:`~.RequestBackendError` is raised. However, no exception would |
| 260 | + be raised if **option 2** below is used. |
| 261 | + |
| 262 | + |
| 263 | +2. Dependency Injection |
| 264 | +----------------------- |
| 265 | + |
| 266 | +The framework using **web-poet** might be using other libraries which doesn't |
| 267 | +have a full support to :mod:`contextvars` `(e.g. Twisted)`. With that, an |
| 268 | +alternative approach would be to supply the request implementation when creating |
| 269 | +an :class:`~.HttpClient` instance: |
| 270 | + |
| 271 | + |
| 272 | +.. code-block:: python |
| 273 | +
|
| 274 | + def request_implementation(r: web_poet.Request) -> web_poet.ResponseData: |
| 275 | + ... |
| 276 | +
|
| 277 | + from web_poet import HttpClient |
| 278 | + http_client = HttpClient(request_downloader=request_implementation) |
0 commit comments