From 7f9a995aa221c75c586510443da9aa4ddedfb9a3 Mon Sep 17 00:00:00 2001 From: Kevin Lloyd Bernal Date: Tue, 1 Mar 2022 22:11:56 +0800 Subject: [PATCH 1/4] add docs for Page Object Project (POP) --- docs/index.rst | 1 + docs/intro/pop.rst | 175 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 176 insertions(+) create mode 100644 docs/intro/pop.rst diff --git a/docs/index.rst b/docs/index.rst index db4d852d..f0e024e9 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -34,6 +34,7 @@ and the motivation behind ``web-poet``, start with :ref:`from-ground-up`. intro/tutorial intro/from-ground-up intro/overrides + intro/pop .. toctree:: :caption: Reference diff --git a/docs/intro/pop.rst b/docs/intro/pop.rst new file mode 100644 index 00000000..a26dd962 --- /dev/null +++ b/docs/intro/pop.rst @@ -0,0 +1,175 @@ +.. _`intro-pop`: + +Page Object Projects (POP) +========================== + +**POPs** are way to package up a group of Page Objects together so they +can be used in other projects as well. This improves code reusability since +the extraction logic in some web pages are easily shareable. More importantly, +**POPs** could be built using other **POPs**. This allows for continuous +improvements by building **POPs** on top of one another. + +Organizing POP +-------------- + +Developers have complete freedom on how to organize their Page Objects +in their project. Here are some options that developers could use. + +Flat Hierarchy +~~~~~~~~~~~~~~ + +A good default option for organizing Page Objects would be to simply have +their respective modules in a flat hierarchy as seen in the example below. + +.. code-block:: + + ecommerce-page-objects + ├── ecommerce_page_objects + | ├── __init__.py + | ├── cool_gadget_site_us_products.py + | ├── cool_gadget_site_us_product_listings.py + | ├── cool_gadget_site_fr_products.py + | ├── cool_gadget_site_fr_product_listings.py + | ├── furniture_shop_products.py + | └── furniture_shop_product_listings.py + └── setup.py + +However, when your Page Object Project grows, it may be difficult to manage +a flat structure like this. + +Hierarchical Directories +~~~~~~~~~~~~~~~~~~~~~~~~ + +One key advantage for organizing the Page Objects into a hierarchy +of subpackages is that large websites could be broken further into +its more granular form. + +A quick example would be websites having multiple country-specific +domains. This could easily be grouped as something like: + +.. code-block:: + + ecommerce-page-objects + ├── ecommerce_page_objects + | ├── cool_gadget_site + | | ├── us + | | | ├── __init__.py + | | | ├── products.py + | | | └── product_listings.py + | | ├── fr + | | | ├── __init__.py + | | | ├── products.py + | | | └── product_listings.py + | | └── __init__.py + | └── furniture_shop + | ├── __init__.py + | ├── products.py + | └── product_listings.py + └── setup.py + +Requirements for POP +-------------------- + +Minimum Requirements +~~~~~~~~~~~~~~~~~~~~ + +This covers the basic use case: + + - Installation of **POP** either from public or private repositories: + + - PyPI + - Git + + - Version specifiers that can be used to accommodate the various parser patches + that come along in any web data extraction project. + - Importing the Page Objects directly from the installed package in a project. + + +This translates into **POPs** needing to have: + + - The ``setup.py`` script which is the standard way of distributing Python packages. + +Thus, the most basic way of packaging **POPs** would be: + +.. code-block:: python + + from setuptools import setup, find_packages + + setup( + name='ecommerce-page-objects', + version='1.0.0', + packages=find_packages(), + install_requires=["web-poet"] + ) + +This allows the **POP** to be installable via ``pip install ecommerce-page-objects==1.0.0`` +`(assuming it's deployed in PyPI)` or via a Git repo like +``pip install git+https://github.com/some-org/ecommerce-page-objects.git@1.0.0`` +`(assuming the repo is public)`. + +After installing the **POP**, anyone could access the Page Objects in it +by simply importing them: + +.. code-block:: python + + from ecommerce_page_objects.furniture_shop.products import FurnitureProductPage + + response_data = download_response("https://www.furnitureshop.com/product/xyz") + page = FurnitureProductPage() + item = page.to_item() + +Recommended Requirements +~~~~~~~~~~~~~~~~~~~~~~~~ + +This covers these use use cases: + + - The minimum requirements and use cases stated above + - The ability to retrieve the declared :class:`~.OverrideRule` + inside the **POP** + +This means that a list of :class:`~.OverrideRule` must be explicitly +declared in the **POP**. This enables projects using the **POP** to know: + + - which URL Patterns a given Page Object is expected to work + - what it's trying to override `(or replace)` + +This could be done by declaring a ``RULES`` variable that can be +imported as a top-level variable from the package. + +For example, suppose our project is named **ecommerce_page_objects** +and is using either of the project structure options discussed in the +previous sections, then we can define the ``RULES`` as the following +inside ``ecommerce_page_objects/ecommerce_page_objects/__init__.py``. + +.. code-block:: python + + from web_poet import default_registry, consume_modules + + consume_modules("ecommerce_page_objects") + RULES = default_registry.get_overrides() + +This allows any developer using a **POP** to easily get the list of +:class:`~.OverrideRule` using the convention of accessing it via the +``RULES`` variable as a top-level variable: + +.. code-block:: python + + from ecommerce_page_objects import RULES + +There may be some circumstances that needs other ways of declaring this. +For such cases, developers/maintainers of **POPs** must reflect that +clearly in the documentation. + + +Conventions and Best Practices +------------------------------ + +1. Page Objects should have its classname end with a **Page** suffix. + + - This allows for easy identification when used by other developers. + +2. The list of :class:`~.OverrideRule` must be declared as a top-level + variable from the package named ``RULES``. + + - This enables other developers to easily retrieve the list of + :class:`~.OverrideRule` to be used in their own projects. From e44e39970d19a84c0ed490864e7d0ba9aef7fbe2 Mon Sep 17 00:00:00 2001 From: Kevin Lloyd Bernal Date: Wed, 2 Mar 2022 10:55:33 +0800 Subject: [PATCH 2/4] update pop docs to include how to properly define and use the entry point rules --- docs/intro/overrides.rst | 12 ++++++++++++ docs/intro/pop.rst | 26 +++++++++++++++++++------- 2 files changed, 31 insertions(+), 7 deletions(-) diff --git a/docs/intro/overrides.rst b/docs/intro/overrides.rst index 43819ee4..2cacf6aa 100644 --- a/docs/intro/overrides.rst +++ b/docs/intro/overrides.rst @@ -213,6 +213,18 @@ This can be done something like: duration. Calling :func:`~.web_poet.overrides.consume_modules` again makes no difference unless a new set of modules are provided. +.. tip:: + + If you're using External Packages which conform to the **POP** + standards as described in the :ref:`intro-pop` section, then retrieving + the rules should be as easy as: + + .. code-block:: python + + import ecommerce_page_objects, gadget_sites_page_objects + + rules = ecommerce_page_objects.RULES + gadget_sites_page_objects.RULES + .. _`intro-rule-subset`: Using only a subset of the available OverrideRules diff --git a/docs/intro/pop.rst b/docs/intro/pop.rst index a26dd962..169cf135 100644 --- a/docs/intro/pop.rst +++ b/docs/intro/pop.rst @@ -145,7 +145,9 @@ inside ``ecommerce_page_objects/ecommerce_page_objects/__init__.py``. from web_poet import default_registry, consume_modules - consume_modules("ecommerce_page_objects") + # This allows all of the OverrideRules declared inside the package + # using @handle_urls to be properly discovered and loaded. + consume_modules(__package__) RULES = default_registry.get_overrides() This allows any developer using a **POP** to easily get the list of @@ -165,11 +167,21 @@ Conventions and Best Practices ------------------------------ 1. Page Objects should have its classname end with a **Page** suffix. - - - This allows for easy identification when used by other developers. + This allows for easy identification when used by other developers. 2. The list of :class:`~.OverrideRule` must be declared as a top-level - variable from the package named ``RULES``. - - - This enables other developers to easily retrieve the list of - :class:`~.OverrideRule` to be used in their own projects. + variable from the package named ``RULES``. This enables other developers + to easily retrieve the list of :class:`~.OverrideRule` to be used in + their own projects. + +3. It is recommended to use the ``web_poet.default_registry`` by default + instead of creating your own custom registries by instantiating + :class:`~.PageObjectRegistry`. This provides a default expectation + for developers on which registry to use right from the start. + +4. When building a new **POP** based of on existing **POPs**, it is + recommended to use an **inclusion** strategy rather than **exclusion** + when selecting the list of :class:`~.OverrideRule` to export. + This is due to the latter having the risk of being brittle when the + underlying source **POPs** change. This could lead to a few + :class:`~.OverrideRule` that are unintentionally included. From 8e805a7ab227467739ccae8c688e8fa051152f31 Mon Sep 17 00:00:00 2001 From: Kevin Lloyd Bernal Date: Wed, 23 Mar 2022 20:50:42 +0800 Subject: [PATCH 3/4] update POP docs after PageObjectRegistry became a dict subclass --- docs/intro/overrides.rst | 8 +++-- docs/intro/pop.rst | 69 +++++++++++++++++++++++++--------------- 2 files changed, 49 insertions(+), 28 deletions(-) diff --git a/docs/intro/overrides.rst b/docs/intro/overrides.rst index 2cacf6aa..03fcaa7d 100644 --- a/docs/intro/overrides.rst +++ b/docs/intro/overrides.rst @@ -217,13 +217,17 @@ This can be done something like: If you're using External Packages which conform to the **POP** standards as described in the :ref:`intro-pop` section, then retrieving - the rules should be as easy as: + the rules could also be done as: .. code-block:: python import ecommerce_page_objects, gadget_sites_page_objects - rules = ecommerce_page_objects.RULES + gadget_sites_page_objects.RULES + # If on Python 3.9+ + rules = ecommerce_page_objects.REGISTRY | gadget_sites_page_objects.REGISTRY + + # If on lower Python versions + rules = {**ecommerce_page_objects.REGISTRY, **gadget_sites_page_objects.REGISTRY} .. _`intro-rule-subset`: diff --git a/docs/intro/pop.rst b/docs/intro/pop.rst index 169cf135..85d08820 100644 --- a/docs/intro/pop.rst +++ b/docs/intro/pop.rst @@ -3,9 +3,8 @@ Page Object Projects (POP) ========================== -**POPs** are way to package up a group of Page Objects together so they -can be used in other projects as well. This improves code reusability since -the extraction logic in some web pages are easily shareable. More importantly, +**POPs** are a way to standardize how a group of Page Objects are packaged +together so they can be uniformly used in other projects. More importantly, **POPs** could be built using other **POPs**. This allows for continuous improvements by building **POPs** on top of one another. @@ -13,7 +12,7 @@ Organizing POP -------------- Developers have complete freedom on how to organize their Page Objects -in their project. Here are some options that developers could use. +in their projects. Here are some of the options that developers could use. Flat Hierarchy ~~~~~~~~~~~~~~ @@ -61,10 +60,11 @@ domains. This could easily be grouped as something like: | | | ├── products.py | | | └── product_listings.py | | └── __init__.py - | └── furniture_shop - | ├── __init__.py - | ├── products.py - | └── product_listings.py + | ├── furniture_shop + | | ├── __init__.py + | | ├── products.py + | | └── product_listings.py + | └── __init__.py └── setup.py Requirements for POP @@ -85,7 +85,7 @@ This covers the basic use case: - Importing the Page Objects directly from the installed package in a project. -This translates into **POPs** needing to have: +This means that **POPs** need to have: - The ``setup.py`` script which is the standard way of distributing Python packages. @@ -114,32 +114,32 @@ by simply importing them: from ecommerce_page_objects.furniture_shop.products import FurnitureProductPage - response_data = download_response("https://www.furnitureshop.com/product/xyz") - page = FurnitureProductPage() + response = download_response("https://www.furnitureshop.com/product/xyz") + page = FurnitureProductPage(response) item = page.to_item() Recommended Requirements ~~~~~~~~~~~~~~~~~~~~~~~~ -This covers these use use cases: +This covers these use cases: - - The minimum requirements and use cases stated above + - The minimum requirements and its use cases - The ability to retrieve the declared :class:`~.OverrideRule` inside the **POP** -This means that a list of :class:`~.OverrideRule` must be explicitly +This means that a collection of :class:`~.OverrideRule` must be explicitly declared in the **POP**. This enables projects using the **POP** to know: - which URL Patterns a given Page Object is expected to work - what it's trying to override `(or replace)` -This could be done by declaring a ``RULES`` variable that can be +This could be done by declaring a ``REGISTRY`` variable that can be imported as a top-level variable from the package. For example, suppose our project is named **ecommerce_page_objects** -and is using either of the project structure options discussed in the -previous sections, then we can define the ``RULES`` as the following -inside ``ecommerce_page_objects/ecommerce_page_objects/__init__.py``. +and is using any of the project structure options discussed in the +previous sections, we can then define the ``REGISTRY`` variable as the following +inside of ``ecommerce-page-objects/ecommerce_page_objects/__init__.py``: .. code-block:: python @@ -148,19 +148,32 @@ inside ``ecommerce_page_objects/ecommerce_page_objects/__init__.py``. # This allows all of the OverrideRules declared inside the package # using @handle_urls to be properly discovered and loaded. consume_modules(__package__) - RULES = default_registry.get_overrides() -This allows any developer using a **POP** to easily get the list of + REGISTRY = default_registry + +This allows any developer using a **POP** to easily access all of the :class:`~.OverrideRule` using the convention of accessing it via the -``RULES`` variable as a top-level variable: +``REGISTRY`` variable. For example: .. code-block:: python - from ecommerce_page_objects import RULES + from ecommerce_page_objects import REGISTRY + +.. tip:: + + The ``default_registry`` is an instance of :class:`~.PageObjectRegistry`, + which in turn is simply a subclass of a ``dict``. This means that you don't + necessarily have to use an instance of :class:`~.PageObjectRegistry` as long + as it has a ``dict``-like interface. -There may be some circumstances that needs other ways of declaring this. -For such cases, developers/maintainers of **POPs** must reflect that -clearly in the documentation. + The :class:`~.PageObjectRegistry` is simply a mapping where the **key** is + the Page Object to use and the **value** is the :class:`~.OverrideRule` it + operates on. This means you can simply use a plain ``dict`` for the + ``REGISTRY`` variable. + + However, it is **recommended** to use the instances of + :class:`~.PageObjectRegistry` to leverage the validation logic for its + contents. Conventions and Best Practices @@ -170,7 +183,7 @@ Conventions and Best Practices This allows for easy identification when used by other developers. 2. The list of :class:`~.OverrideRule` must be declared as a top-level - variable from the package named ``RULES``. This enables other developers + variable from the package named ``REGISTRY``. This enables other developers to easily retrieve the list of :class:`~.OverrideRule` to be used in their own projects. @@ -179,6 +192,10 @@ Conventions and Best Practices :class:`~.PageObjectRegistry`. This provides a default expectation for developers on which registry to use right from the start. + * However, there will be some cases where creating a new instance of + :class:`~.PageObjectRegistry` is inevitably needed. Here's an + :ref:`example ` in the tutorial section. + 4. When building a new **POP** based of on existing **POPs**, it is recommended to use an **inclusion** strategy rather than **exclusion** when selecting the list of :class:`~.OverrideRule` to export. From e256b962577548e7d4772c2711fb0065311178b2 Mon Sep 17 00:00:00 2001 From: Kevin Lloyd Bernal Date: Tue, 29 Mar 2022 19:34:19 +0800 Subject: [PATCH 4/4] simplify and make the POP doc clearer --- docs/intro/overrides.rst | 16 ----------- docs/intro/pop.rst | 61 ++++++++++++++++++++++------------------ 2 files changed, 33 insertions(+), 44 deletions(-) diff --git a/docs/intro/overrides.rst b/docs/intro/overrides.rst index 03fcaa7d..43819ee4 100644 --- a/docs/intro/overrides.rst +++ b/docs/intro/overrides.rst @@ -213,22 +213,6 @@ This can be done something like: duration. Calling :func:`~.web_poet.overrides.consume_modules` again makes no difference unless a new set of modules are provided. -.. tip:: - - If you're using External Packages which conform to the **POP** - standards as described in the :ref:`intro-pop` section, then retrieving - the rules could also be done as: - - .. code-block:: python - - import ecommerce_page_objects, gadget_sites_page_objects - - # If on Python 3.9+ - rules = ecommerce_page_objects.REGISTRY | gadget_sites_page_objects.REGISTRY - - # If on lower Python versions - rules = {**ecommerce_page_objects.REGISTRY, **gadget_sites_page_objects.REGISTRY} - .. _`intro-rule-subset`: Using only a subset of the available OverrideRules diff --git a/docs/intro/pop.rst b/docs/intro/pop.rst index 85d08820..1353844f 100644 --- a/docs/intro/pop.rst +++ b/docs/intro/pop.rst @@ -123,57 +123,62 @@ Recommended Requirements This covers these use cases: - - The minimum requirements and its use cases + - The `minimum requirements` and its use cases - The ability to retrieve the declared :class:`~.OverrideRule` - inside the **POP** + available inside the **POP** -This means that a collection of :class:`~.OverrideRule` must be explicitly -declared in the **POP**. This enables projects using the **POP** to know: +This means that a collection of :class:`~.OverrideRule` must be properly +discovered within the **POP**. This enables projects using the **POP** to know: - which URL Patterns a given Page Object is expected to work - what it's trying to override `(or replace)` -This could be done by declaring a ``REGISTRY`` variable that can be -imported as a top-level variable from the package. - -For example, suppose our project is named **ecommerce_page_objects** +To give an example, suppose our **POP** is named **ecommerce_page_objects** and is using any of the project structure options discussed in the -previous sections, we can then define the ``REGISTRY`` variable as the following -inside of ``ecommerce-page-objects/ecommerce_page_objects/__init__.py``: +previous sections. We can then define the entry point of discovering +all :class:`~.OverrideRule` by writing the following code inside of +``ecommerce-page-objects/ecommerce_page_objects/__init__.py``: .. code-block:: python - from web_poet import default_registry, consume_modules + from web_poet import consume_modules # This allows all of the OverrideRules declared inside the package # using @handle_urls to be properly discovered and loaded. consume_modules(__package__) - REGISTRY = default_registry +.. note:: + + Remember, code in Python like annotations are only read and executed + when the module it belongs to is imported. Thus, in order for all the + ``@handle_urls`` annotation to properly reflect its data, they need to + be imported recursively via :func:`~.consume_modules`. + +This allows developers to properly access all of the :class:`~.OverrideRule` +declared using the ``@handle_urls`` annotation inside the **POP**. In turn, +this also allows **POPs** which use ``web_poet.default_registry`` to have all +their rules discovered if they are adhering to using Convention **#3** +(see :ref:`best-practices`). -This allows any developer using a **POP** to easily access all of the -:class:`~.OverrideRule` using the convention of accessing it via the -``REGISTRY`` variable. For example: +In other words, importing the ``ecommerce_page_objects`` **POP** to a +project immediately loads all of the rules in **web-poet's** +``default_registry``: .. code-block:: python - from ecommerce_page_objects import REGISTRY + from web_poet import default_registry -.. tip:: + import ecommerce_page_objects - The ``default_registry`` is an instance of :class:`~.PageObjectRegistry`, - which in turn is simply a subclass of a ``dict``. This means that you don't - necessarily have to use an instance of :class:`~.PageObjectRegistry` as long - as it has a ``dict``-like interface. + # All the rules are now available. + rules = default_registry.get_overrides() - The :class:`~.PageObjectRegistry` is simply a mapping where the **key** is - the Page Object to use and the **value** is the :class:`~.OverrideRule` it - operates on. This means you can simply use a plain ``dict`` for the - ``REGISTRY`` variable. +If this recommended requirement is followed properly, there's no need to +call ``consume_modules("ecommerce_page_objects")`` before performing the +:meth:`~.PageObjectRegistry.get_overrides`, since all the :class:`~.OverrideRule` +were already discovered upon **POP** importation. - However, it is **recommended** to use the instances of - :class:`~.PageObjectRegistry` to leverage the validation logic for its - contents. +.. _`best-practices`: Conventions and Best Practices