Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41105: [Python][Docs] Update PyArrow installation docs for conda package split #41135

Merged
merged 12 commits into from
May 16, 2024

Conversation

amoeba
Copy link
Member

@amoeba amoeba commented Apr 10, 2024

Do not merge until some discussion is had about how to time this relative to conda-forge/arrow-cpp-feedstock#1376. Additionally, consider hot-patching this into arrow-site if appropriate.

What changes are included in this PR?

Updates to the Python installation docs to reflect the in-progress change splitting PyArrow on conda-forge into three separate packages. Specifically:

  1. Add a note in the conda section highlighting that there are three packages and linking to a new section (2) in order to provide more information
  2. Add a new section, linked from (1), providing a comparison of each package as a table

Are these changes tested?

These are just docs changes. I have built them locally and they look fine.

Are there any user-facing changes?

Just docs.

+------------+------------------------------+------------------------------+------------------------------+
| Component | pyarrow | pyarrow-core | pyarrow-all |
+============+==============================+==============================+==============================+
| Core | :fas:`check;sd-text-success` | :fas:`check;sd-text-success` | :fas:`check;sd-text-success` |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: I used "Core" here for lack of a better word. I wasn't aware of any common way of referring to the functionality libarrow provides to PyArrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Arrow C++ minimal or Arrow minimal? if it isn't too long (mentioned here: https://arrow.apache.org/docs/dev/developers/cpp/building.html#optional-components)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out, it might work.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 10, 2024
| Flight SQL | | | :fas:`check;sd-text-success` |
+------------+------------------------------+------------------------------+------------------------------+
| Gandiva | | | :fas:`check;sd-text-success` |
+------------+------------------------------+------------------------------+------------------------------+
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rows in the table above are based on @raulcd's breakdown on the mailing list but I think we want to be careful about what's listed here. I think the best thing would be for this table to fully align with the submodules exposed in PyArrow since that's what the user is most familiar with. And we might even consider renaming the "Component" column to "Module" and using the literal module names. i.e., parquet instead of Parquet so it's clear we're talking about being able to run import pyarrow.parquet or not.

This would mean the table isn't complete yet (json, csv, filesystems, orc, more?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with aligning with PyArrow on the names 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In looking closer, I think it makes sense to align on conda package names too so in cccc7ff I updated the table to include the names of the conda package for each component.

I need to look at where the modules I mentioned above (JSON, etc) live because I think that should be clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look at where the modules I mentioned above (JSON, etc) live because I think that should be clear.

Right now all the things you mentioned (json, csv, filesystems, orc) are included in the main libarrow.so, and so can only be either enabled or disabled (cannot be installed separately).
For the filesystems, there is work under way to actually build them as separate libraries, so we will have to update this doc section once that is done. For ORC I am actually wondering if we should open an issue about splitting that in a separate library (although liborc is not that big of a dependency, so it might not be super important)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both of you. The latest draft now tries to make it clear which Python modules each package provides. It'd be good to get reviews on the overall style and also the correctnes/completeness.

@raulcd
Copy link
Member

raulcd commented Apr 11, 2024

@github-actions crossbow submit preview-docs

Copy link

Revision: 4fc75b7

Submitted crossbow builds: ursacomputing/crossbow @ actions-eab855c0d2

Task Status
preview-docs GitHub Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 11, 2024
@jorisvandenbossche
Copy link
Member

We might want to add an example about how to create an env with a custom selection by installing the core package plus just one additional module. I think that will be something like conda install pyarrow-core libparquet

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 1, 2024
@amoeba
Copy link
Member Author

amoeba commented May 1, 2024

Oh, that's really neat. I added a subsection about custom selections but let me know if you think it needs higher billing.

Next steps on this PR are:

@amoeba
Copy link
Member Author

amoeba commented May 1, 2024

@github-actions crossbow submit preview-docs

Copy link

github-actions bot commented May 1, 2024

Revision: be37b52

Submitted crossbow builds: ursacomputing/crossbow @ actions-c33978f0b1

Task Status
preview-docs GitHub Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 2, 2024
@ianmcook
Copy link
Member

Thanks for doing this @amoeba. If possible, can we include some more information about which parts of pyarrow.compute are available / usable in the pyarrow-core package? See the discussion about that here: conda-forge/arrow-cpp-feedstock#1255 (comment)

@amoeba
Copy link
Member Author

amoeba commented May 13, 2024

Thanks @ianmcook, I debated how to present this. The two options were to either list the Python modules or the shared libraries and I went with the latter in the current revision, though this may have been mostly out of convenience. Your suggestion would be to go with the former and that's probably better.

I'll take a look a that and see about updating the table.

@ianmcook
Copy link
Member

ianmcook commented May 13, 2024

Presenting the table showing how the three packages map onto the shared libraries seems worthwhile regardless, because it's a clean mapping.

I don't know whether the Python modules map as cleanly onto these three packages. It sounds like maybe they don't. The comment from @h-vetinari suggests that some pyarrow.compute functions require libarrow-acero while others don't.

So maybe a longer prose explanation or a bulleted list will be required to explain the latter.

@amoeba
Copy link
Member Author

amoeba commented May 13, 2024

Good points, thanks.

@raulcd
Copy link
Member

raulcd commented May 14, 2024

I thought pyarrow.compute required acero but that might not be the case. We will have to investigate if that's the case or in the other hand compute does not require acero at all. Probably someone with more knowledge on pyarrow like @jorisvandenbossche can help answer that one.

@jorisvandenbossche
Copy link
Member

I just answered on the feedstock issue (conda-forge/arrow-cpp-feedstock#1255 (comment)) before seeing the discussion here, but indeed, the individual compute kernels (what is in pyarrow.compute) are fully usable without acero being present. You only cannot compose them / run then in parallel on batches / ... for which you need am Acero plan.

We have an issue about splitting the compute kernels into its own shared library (#25025), and that would allow it to be installed separately as well like a libarrow-compute, but until then it's either all included in the main libarrow.so or not (if you would build libarrow.so without compute support)

- :ref:`compute` (i.e., ``pyarrow.compute``)
- :ref:`io`
- :ref:`ipc` (i.e., ``pyarrow.ipc``)
- :ref:`filesystem` (HDFS, S3, GCS, etc.)
Copy link
Member

@jorisvandenbossche jorisvandenbossche May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we list that here, I think we should also say that those cloud filesystem are planned to moved out of pyarrow-core in the next release, and so you should install pyarrow if you want to rely on those being present

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow.fs itself is always available, though (with at a minimum just the LocalFileSystem)

Copy link
Member Author

@amoeba amoeba May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks. Changed in 51f2fde.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 14, 2024
amoeba and others added 3 commits May 14, 2024 16:32
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 15, 2024
@amoeba
Copy link
Member Author

amoeba commented May 15, 2024

@github-actions crossbow submit preview-docs

Copy link

Revision: 51f2fde

Submitted crossbow builds: ursacomputing/crossbow @ actions-a5a94423db

Task Status
preview-docs GitHub Actions

@amoeba
Copy link
Member Author

amoeba commented May 15, 2024

Thanks everyone for the reviews. I've addressed what comments have been left (thank you!) and I think this is very close to merge-able. The preview for the latest changeset is at http://crossbow.voltrondata.com/pr_docs/41135/python/install.html#using-conda.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 16, 2024
@amoeba amoeba merged commit f5ac05c into apache:main May 16, 2024
8 checks passed
@amoeba amoeba removed the awaiting merge Awaiting merge label May 16, 2024
@amoeba
Copy link
Member Author

amoeba commented May 16, 2024

Thanks for the reviews all.

I think we'll want this to appear on the website sooner than the 17 release I've filed apache/arrow-site#520.

Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit f5ac05c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

jorisvandenbossche pushed a commit to apache/arrow-site that referenced this pull request May 17, 2024
My attempt at updating docs/python/install.html with teh changes in
apache/arrow#41135.

I generated the docs locally, copied the generated install.html into
arrow-site, and then only committed the hunks I know changed. I didn't
commit the entire changed file since the diff included many more
changes, some of which looked like they'd break the page.
@jorisvandenbossche
Copy link
Member

I think we'll want this to appear on the website sooner than the 17 release I've filed apache/arrow-site#520.

Thanks for dong that! I also labeled the issue here with backport-candidate label, so that in case we do a 16.1.1 release, it will get included

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…onda package split (apache#41135)

Do not merge until some discussion is had about how to time this relative to conda-forge/arrow-cpp-feedstock#1376. Additionally, consider hot-patching this into arrow-site if appropriate.

### What changes are included in this PR?

Updates to the [Python installation docs](https://arrow.apache.org/docs/python/install.html) to reflect the in-progress change splitting PyArrow on conda-forge into three separate packages. Specifically:

1. Add a note in the conda section highlighting that there are three packages and linking to a new section (2) in order to provide more information
2. Add a new section, linked from (1), providing a comparison of each package as a table

### Are these changes tested?

These are just docs changes. I have built them locally and they look fine.

### Are there any user-facing changes?

Just docs.
* GitHub Issue: apache#41105

Lead-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Bryce Mecum <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants