Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41105: [Python][Docs] Update PyArrow installation docs for conda package split #41135

Merged
merged 12 commits into from
May 16, 2024
1 change: 1 addition & 0 deletions docs/source/python/flight.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

.. currentmodule:: pyarrow.flight
.. highlight:: python
.. _flight:

================
Arrow Flight RPC
Expand Down
46 changes: 36 additions & 10 deletions docs/source/python/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,17 +107,41 @@ a custom path to the database from Python:
Differences between conda-forge packages
----------------------------------------

PyArrow is packaged on `conda-forge <https://conda-forge.org/>`_ as three
On `conda-forge <https://conda-forge.org/>`_, PyArrow is published as three
separate packages, each providing varying levels of functionality. This is in
contrast to PyPi, where only a single PyArrow package is provided.

The purpose of this split is to minimize the size of the installed package for
most users (``pyarrow``), provide a smaller, minimal package for specialized use
cases (``pyarrow-core``), while still providing a complete package for users who
require it (``pyarrow-all``).
require it (``pyarrow-all``). What was historically ``pyarrow`` on
`conda-forge <https://conda-forge.org/>`_ is now ``pyarrow-all``, though most
users can continue using ``pyarrow``.

The table below lists the functionality provided by each package and may be
useful when deciding to use one package over another:
The ``pyarrow-core`` package includes the following functionality:

- :ref:`data`
- :ref:`compute` (i.e., ``pyarrow.compute``)
- :ref:`io`
- :ref:`ipc` (i.e., ``pyarrow.ipc``)
- :ref:`filesystem` (HDFS, S3, GCS, etc.)
Copy link
Member

@jorisvandenbossche jorisvandenbossche May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we list that here, I think we should also say that those cloud filesystem are planned to moved out of pyarrow-core in the next release, and so you should install pyarrow if you want to rely on those being present

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow.fs itself is always available, though (with at a minimum just the LocalFileSystem)

Copy link
Member Author

@amoeba amoeba May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks. Changed in 51f2fde.

- File formats: :ref:`Arrow/Feather<feather>`, :ref:`JSON<json>`, :ref:`CSV<py-csv>`, :ref:`ORC<orc>` (but not Parquet)

The ``pyarrow`` package adds the following:

- Acero
amoeba marked this conversation as resolved.
Show resolved Hide resolved
- :ref:`dataset` (i.e., ``pyarrow.dataset``)
- :ref:`Parquet<parquet>` (i.e., ``pyarrow.parquet``)
- Substrait
amoeba marked this conversation as resolved.
Show resolved Hide resolved

Finally, ``pyarrow-all`` adds:

- :ref:`flight` and Flight SQL (i.e., ``pyarrow.flight``)
- Gandiva
amoeba marked this conversation as resolved.
Show resolved Hide resolved

The following table lists the functionality provided by each package and may be
useful when deciding to use one package over another or when
:ref:`python-conda-custom-selection`.

+------------+---------------------+--------------+---------+-------------+
| Component | Package | pyarrow-core | pyarrow | pyarrow-all |
Expand All @@ -139,20 +163,22 @@ useful when deciding to use one package over another:
| Gandiva | libarrow-gandiva | | | ✓ |
+------------+---------------------+--------------+---------+-------------+

Creating Custom Selections
^^^^^^^^^^^^^^^^^^^^^^^^^^
.. _python-conda-custom-selection:

Creating A Custom Selection
^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you know which components you need and want to control what's installed, you
can create a custom selection of packages to include just the extra features you
need. For example, to install ``pyarrow-core`` and just support for reading and
can create a custom selection of packages to include only the extra features you
need. For example, to install ``pyarrow-core`` and add support for reading and
writing Parquet, install ``libparquet`` alongside ``pyarrow-core``:

.. code-block:: shell

conda install pyarrow-core libparquet
conda install -c conda-forge pyarrow-core libparquet

Or if you wish to use ``pyarrow`` but need support for Flight RPC:

.. code-block:: shell

conda install pyarrow libarrow-flight
conda install -c conda-forge pyarrow libarrow-flight
Loading