-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41105: [Python][Docs] Update PyArrow installation docs for conda package split #41135
Conversation
docs/source/python/install.rst
Outdated
+------------+------------------------------+------------------------------+------------------------------+ | ||
| Component | pyarrow | pyarrow-core | pyarrow-all | | ||
+============+==============================+==============================+==============================+ | ||
| Core | :fas:`check;sd-text-success` | :fas:`check;sd-text-success` | :fas:`check;sd-text-success` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: I used "Core" here for lack of a better word. I wasn't aware of any common way of referring to the functionality libarrow provides to PyArrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Arrow C++ minimal
or Arrow minimal
? if it isn't too long (mentioned here: https://arrow.apache.org/docs/dev/developers/cpp/building.html#optional-components)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out, it might work.
docs/source/python/install.rst
Outdated
| Flight SQL | | | :fas:`check;sd-text-success` | | ||
+------------+------------------------------+------------------------------+------------------------------+ | ||
| Gandiva | | | :fas:`check;sd-text-success` | | ||
+------------+------------------------------+------------------------------+------------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rows in the table above are based on @raulcd's breakdown on the mailing list but I think we want to be careful about what's listed here. I think the best thing would be for this table to fully align with the submodules exposed in PyArrow since that's what the user is most familiar with. And we might even consider renaming the "Component" column to "Module" and using the literal module names. i.e., parquet
instead of Parquet so it's clear we're talking about being able to run import pyarrow.parquet
or not.
This would mean the table isn't complete yet (json, csv, filesystems, orc, more?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with aligning with PyArrow on the names 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In looking closer, I think it makes sense to align on conda package names too so in cccc7ff I updated the table to include the names of the conda package for each component.
I need to look at where the modules I mentioned above (JSON, etc) live because I think that should be clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to look at where the modules I mentioned above (JSON, etc) live because I think that should be clear.
Right now all the things you mentioned (json, csv, filesystems, orc) are included in the main libarrow.so, and so can only be either enabled or disabled (cannot be installed separately).
For the filesystems, there is work under way to actually build them as separate libraries, so we will have to update this doc section once that is done. For ORC I am actually wondering if we should open an issue about splitting that in a separate library (although liborc is not that big of a dependency, so it might not be super important)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both of you. The latest draft now tries to make it clear which Python modules each package provides. It'd be good to get reviews on the overall style and also the correctnes/completeness.
@github-actions crossbow submit preview-docs |
Revision: 4fc75b7 Submitted crossbow builds: ursacomputing/crossbow @ actions-eab855c0d2
|
We might want to add an example about how to create an env with a custom selection by installing the core package plus just one additional module. I think that will be something like |
Oh, that's really neat. I added a subsection about custom selections but let me know if you think it needs higher billing. Next steps on this PR are:
|
@github-actions crossbow submit preview-docs |
Revision: be37b52 Submitted crossbow builds: ursacomputing/crossbow @ actions-c33978f0b1
|
Thanks for doing this @amoeba. If possible, can we include some more information about which parts of |
Thanks @ianmcook, I debated how to present this. The two options were to either list the Python modules or the shared libraries and I went with the latter in the current revision, though this may have been mostly out of convenience. Your suggestion would be to go with the former and that's probably better. I'll take a look a that and see about updating the table. |
Presenting the table showing how the three packages map onto the shared libraries seems worthwhile regardless, because it's a clean mapping. I don't know whether the Python modules map as cleanly onto these three packages. It sounds like maybe they don't. The comment from @h-vetinari suggests that some So maybe a longer prose explanation or a bulleted list will be required to explain the latter. |
Good points, thanks. |
I thought |
I just answered on the feedstock issue (conda-forge/arrow-cpp-feedstock#1255 (comment)) before seeing the discussion here, but indeed, the individual compute kernels (what is in We have an issue about splitting the compute kernels into its own shared library (#25025), and that would allow it to be installed separately as well like a libarrow-compute, but until then it's either all included in the main libarrow.so or not (if you would build libarrow.so without compute support) |
docs/source/python/install.rst
Outdated
- :ref:`compute` (i.e., ``pyarrow.compute``) | ||
- :ref:`io` | ||
- :ref:`ipc` (i.e., ``pyarrow.ipc``) | ||
- :ref:`filesystem` (HDFS, S3, GCS, etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we list that here, I think we should also say that those cloud filesystem are planned to moved out of pyarrow-core in the next release, and so you should install pyarrow
if you want to rely on those being present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow.fs
itself is always available, though (with at a minimum just the LocalFileSystem)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks. Changed in 51f2fde.
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
@github-actions crossbow submit preview-docs |
Revision: 51f2fde Submitted crossbow builds: ursacomputing/crossbow @ actions-a5a94423db
|
Thanks everyone for the reviews. I've addressed what comments have been left (thank you!) and I think this is very close to merge-able. The preview for the latest changeset is at http://crossbow.voltrondata.com/pr_docs/41135/python/install.html#using-conda. |
Thanks for the reviews all. I think we'll want this to appear on the website sooner than the 17 release I've filed apache/arrow-site#520. |
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit f5ac05c. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
My attempt at updating docs/python/install.html with teh changes in apache/arrow#41135. I generated the docs locally, copied the generated install.html into arrow-site, and then only committed the hunks I know changed. I didn't commit the entire changed file since the diff included many more changes, some of which looked like they'd break the page.
Thanks for dong that! I also labeled the issue here with backport-candidate label, so that in case we do a 16.1.1 release, it will get included |
…onda package split (apache#41135) Do not merge until some discussion is had about how to time this relative to conda-forge/arrow-cpp-feedstock#1376. Additionally, consider hot-patching this into arrow-site if appropriate. ### What changes are included in this PR? Updates to the [Python installation docs](https://arrow.apache.org/docs/python/install.html) to reflect the in-progress change splitting PyArrow on conda-forge into three separate packages. Specifically: 1. Add a note in the conda section highlighting that there are three packages and linking to a new section (2) in order to provide more information 2. Add a new section, linked from (1), providing a comparison of each package as a table ### Are these changes tested? These are just docs changes. I have built them locally and they look fine. ### Are there any user-facing changes? Just docs. * GitHub Issue: apache#41105 Lead-authored-by: Bryce Mecum <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Bryce Mecum <[email protected]>
Do not merge until some discussion is had about how to time this relative to conda-forge/arrow-cpp-feedstock#1376. Additionally, consider hot-patching this into arrow-site if appropriate.
What changes are included in this PR?
Updates to the Python installation docs to reflect the in-progress change splitting PyArrow on conda-forge into three separate packages. Specifically:
Are these changes tested?
These are just docs changes. I have built them locally and they look fine.
Are there any user-facing changes?
Just docs.