Skip to content

[SPARK-52561][PYTHON][INFRA] Upgrade the minimum version of Python to 3.10 #51259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jun 24, 2025

What changes were proposed in this pull request?

Upgrade the minimum version of Python to 3.10

Why are the changes needed?

Python 3.9 is reaching its EOL

Does this PR introduce any user-facing change?

yes, doc change

How was this patch tested?

PR builder with upgraded image

https://github.com/zhengruifeng/spark/actions/runs/16064529566/job/45340924656

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to fix more places below. Can be done in a separate PR

.github/workflows/build_infra_images_cache.yml:      - name: Build and push (PySpark with Python 3.9)
.github/workflows/build_infra_images_cache.yml:      - name: Image digest (PySpark with Python 3.9)
.github/workflows/build_python_3.9.yml:name: "Build / Python-only (master, Python 3.9)"
dev/create-release/spark-rm/Dockerfile:# Install Python 3.9
dev/infra/Dockerfile:# Install Python 3.9
dev/spark-test-image/python-309/Dockerfile:LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image For PySpark with Python 3.09"
dev/spark-test-image/python-309/Dockerfile:# Install Python 3.9
dev/spark-test-image/python-309/Dockerfile:# Python deps for Spark Connect
dev/spark-test-image/python-309/Dockerfile:# Install Python 3.9 packages
dev/spark-test-image/python-minimum/Dockerfile:# Install Python 3.9
dev/spark-test-image/python-minimum/Dockerfile:# Install Python 3.9 packages
dev/spark-test-image/python-ps-minimum/Dockerfile:# Install Python 3.9
dev/spark-test-image/python-ps-minimum/Dockerfile:# Install Python 3.9 packages
docs/index.md:Spark runs on Java 17/21, Scala 2.13, Python 3.9+, and R 3.5+ (Deprecated).
docs/rdd-programming-guide.md:Spark {{site.SPARK_VERSION}} works with Python 3.9+. It can use the standard CPython interpreter,
python/docs/source/development/contributing.rst:    # Python 3.9+ is required
python/docs/source/development/contributing.rst:With Python 3.9+, pip can be used as below to install and set up the development environment.
python/docs/source/getting_started/install.rst:Python 3.9 and above.
python/docs/source/tutorial/pandas_on_spark/typehints.rst:With Python 3.9+, you can specify the type hints by using pandas instances as follows:
python/packaging/classic/setup.py:            "Programming Language :: Python :: 3.9",
python/packaging/client/setup.py:            "Programming Language :: Python :: 3.9",
python/packaging/connect/setup.py:            "Programming Language :: Python :: 3.9",
python/pyspark/cloudpickle/cloudpickle.py:        # "nogil" Python: modified attributes from 3.9
python/pyspark/pandas/typedef/typehints.py:# TODO: Remove this variadic-generic hack by tuple once ww drop Python up to 3.9.
python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py:        # SPARK-30921: We should not pushdown predicates of PythonUDFs through Aggregate.
python/pyspark/sql/udf.py:    # Note: Python 3.9.15, Pandas 1.5.2 and PyArrow 10.0.1 are used.
python/run-tests:  echo "Python versions prior to 3.9 are not supported."
.github/workflows/build_and_test.yml:            python3.9 ./dev/structured_logging_style.py
.github/workflows/build_and_test.yml:        python3.9 -m pip install 'flake8==3.9.0' pydata_sphinx_theme 'mypy==0.982' 'pytest==7.1.3' 'pytest-mypy-plugins==1.9.3' numpydoc 'jinja2<3.0.0' 'black==22.6.0'
.github/workflows/build_and_test.yml:        python3.9 -m pip install 'pandas-stubs==1.2.0.53' ipython 'grpcio==1.56.0' 'grpc-stubs==1.24.11' 'googleapis-common-protos-stubs==2.2.0'
.github/workflows/build_and_test.yml:      run: python3.9 -m pip list
.github/workflows/build_and_test.yml:      run: PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
.github/workflows/build_and_test.yml:        python3.9 -m pip install 'protobuf==4.25.1' 'mypy-protobuf==3.3.0'
.github/workflows/build_and_test.yml:      run: if test -f ./dev/connect-check-protos.py; then PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 ./dev/connect-check-protos.py; fi
.github/workflows/build_and_test.yml:      PYSPARK_DRIVER_PYTHON: python3.9
.github/workflows/build_and_test.yml:      PYSPARK_PYTHON: python3.9
.github/workflows/build_and_test.yml:        python3.9 -m pip install 'sphinx==4.5.0' mkdocs 'pydata_sphinx_theme>=0.13' sphinx-copybutton nbsphinx numpydoc jinja2 markupsafe 'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 'sphinxcontrib-serializinghtml==1.1.5'
.github/workflows/build_and_test.yml:        python3.9 -m pip install ipython_genutils # See SPARK-38517
.github/workflows/build_and_test.yml:        python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' pyarrow pandas 'plotly<6.0.0'
.github/workflows/build_and_test.yml:        python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421
.github/workflows/build_and_test.yml:      run: python3.9 -m pip list
.github/workflows/build_and_test.yml:        # We need this link to make sure `python3` points to `python3.9` which contains the prerequisite packages.
.github/workflows/build_and_test.yml:        ln -s "$(which python3.9)" "/usr/local/bin/python3"
.github/workflows/build_and_test.yml:          pyspark_modules=`cd dev && python3.9 -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark')))"`
.github/workflows/build_infra_images_cache.yml:    - 'dev/spark-test-image/python-309/Dockerfile'
.github/workflows/build_infra_images_cache.yml:        if: hashFiles('dev/spark-test-image/python-309/Dockerfile') != ''
.github/workflows/build_infra_images_cache.yml:        id: docker_build_pyspark_python_309
.github/workflows/build_infra_images_cache.yml:          context: ./dev/spark-test-image/python-309/
.github/workflows/build_infra_images_cache.yml:          tags: ghcr.io/apache/spark/apache-spark-github-action-image-pyspark-python-309-cache:${{ github.ref_name }}-static
.github/workflows/build_infra_images_cache.yml:          cache-from: type=registry,ref=ghcr.io/apache/spark/apache-spark-github-action-image-pyspark-python-309-cache:${{ github.ref_name }}
.github/workflows/build_infra_images_cache.yml:          cache-to: type=registry,ref=ghcr.io/apache/spark/apache-spark-github-action-image-pyspark-python-309-cache:${{ github.ref_name }},mode=max
.github/workflows/build_infra_images_cache.yml:        if: hashFiles('dev/spark-test-image/python-309/Dockerfile') != ''
.github/workflows/build_infra_images_cache.yml:        run: echo ${{ steps.docker_build_pyspark_python_309.outputs.digest }}
.github/workflows/build_python_3.9.yml:          "PYSPARK_IMAGE_TO_TEST": "python-309",
.github/workflows/build_python_3.9.yml:          "PYTHON_TO_TEST": "python3.9"
.github/workflows/build_python_minimum.yml:          "PYTHON_TO_TEST": "python3.9"
.github/workflows/build_python_ps_minimum.yml:          "PYTHON_TO_TEST": "python3.9"
README.md:|            | [![GitHub Actions Build](https://github.com/apache/spark/actions/workflows/build_python_3.9.yml/badge.svg)](https://github.com/apache/spark/actions/workflows/build_python_3.9.yml)                             |
dev/create-release/spark-rm/Dockerfile:    python3.9 python3.9-distutils \
dev/create-release/spark-rm/Dockerfile:RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9
dev/create-release/spark-rm/Dockerfile:RUN python3.9 -m pip install --ignore-installed blinker>=1.6.2 # mlflow needs this
dev/create-release/spark-rm/Dockerfile:RUN python3.9 -m pip install --force $BASIC_PIP_PKGS unittest-xml-reporting $CONNECT_PIP_PKGS && \
dev/create-release/spark-rm/Dockerfile:    python3.9 -m pip install 'torch<2.6.0' torchvision --index-url https://download.pytorch.org/whl/cpu && \
dev/create-release/spark-rm/Dockerfile:    python3.9 -m pip install torcheval && \
dev/create-release/spark-rm/Dockerfile:    python3.9 -m pip cache purge
dev/create-release/spark-rm/Dockerfile:RUN python3.9 -m pip install 'sphinx==4.5.0' mkdocs 'pydata_sphinx_theme>=0.13' sphinx-copybutton nbsphinx numpydoc jinja2 markupsafe 'pyzmq<24.0.0' \
dev/create-release/spark-rm/Dockerfile:RUN python3.9 -m pip list
dev/create-release/spark-rm/Dockerfile:RUN ln -s "$(which python3.9)" "/usr/local/bin/python"
dev/create-release/spark-rm/Dockerfile:RUN ln -s "$(which python3.9)" "/usr/local/bin/python3"
dev/infra/Dockerfile:    python3.9 python3.9-distutils \
dev/infra/Dockerfile:RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9
dev/infra/Dockerfile:RUN python3.9 -m pip install --ignore-installed blinker>=1.6.2 # mlflow needs this
dev/infra/Dockerfile:RUN python3.9 -m pip install --force $BASIC_PIP_PKGS unittest-xml-reporting $CONNECT_PIP_PKGS && \
dev/infra/Dockerfile:    python3.9 -m pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu && \
dev/infra/Dockerfile:    python3.9 -m pip install torcheval && \
dev/infra/Dockerfile:    python3.9 -m pip cache purge
dev/spark-test-image-util/docs/run-in-container:# We need this link to make sure `python3` points to `python3.9` which contains the prerequisite packages.
dev/spark-test-image-util/docs/run-in-container:ln -s "$(which python3.9)" "/usr/local/bin/python3"
dev/spark-test-image/python-309/Dockerfile:    libpython3-dev \
dev/spark-test-image/python-309/Dockerfile:    python3.9 \
dev/spark-test-image/python-309/Dockerfile:    python3.9-distutils \
dev/spark-test-image/python-309/Dockerfile:RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9
dev/spark-test-image/python-309/Dockerfile:RUN python3.9 -m pip install --ignore-installed blinker>=1.6.2 # mlflow needs this
dev/spark-test-image/python-309/Dockerfile:RUN python3.9 -m pip install --force $BASIC_PIP_PKGS unittest-xml-reporting $CONNECT_PIP_PKGS && \
dev/spark-test-image/python-309/Dockerfile:    python3.9 -m pip install 'torch<2.6.0' torchvision --index-url https://download.pytorch.org/whl/cpu && \
dev/spark-test-image/python-309/Dockerfile:    python3.9 -m pip install torcheval && \
dev/spark-test-image/python-309/Dockerfile:    python3.9 -m pip cache purge
dev/spark-test-image/python-minimum/Dockerfile:    python3.9 \
dev/spark-test-image/python-minimum/Dockerfile:    python3.9-distutils \
dev/spark-test-image/python-minimum/Dockerfile:RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9
dev/spark-test-image/python-minimum/Dockerfile:RUN python3.9 -m pip install --force $BASIC_PIP_PKGS $CONNECT_PIP_PKGS && \
dev/spark-test-image/python-minimum/Dockerfile:    python3.9 -m pip cache purge
dev/spark-test-image/python-ps-minimum/Dockerfile:    python3.9 \
dev/spark-test-image/python-ps-minimum/Dockerfile:    python3.9-distutils \
dev/spark-test-image/python-ps-minimum/Dockerfile:RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9
dev/spark-test-image/python-ps-minimum/Dockerfile:RUN python3.9 -m pip install --force $BASIC_PIP_PKGS $CONNECT_PIP_PKGS && \
dev/spark-test-image/python-ps-minimum/Dockerfile:    python3.9 -m pip cache purge
python/docs/source/development/contributing.rst:    conda create --name pyspark-dev-env python=3.9
python/docs/source/getting_started/install.rst:    conda install -c conda-forge pyspark  # can also add "python=3.9 some_package [etc.]" here
python/packaging/classic/setup.py:        python_requires=">=3.9",
python/packaging/client/setup.py:        python_requires=">=3.9",
python/packaging/connect/setup.py:        python_requires=">=3.9",

@xinrong-meng
Copy link
Member

Spark in yarn-client mode seems failing

@zhengruifeng zhengruifeng marked this pull request as draft June 25, 2025 03:17
@zhengruifeng
Copy link
Contributor Author

convert to draft since jobs gets stuck

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this.

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix 🎉

@zhengruifeng zhengruifeng force-pushed the py_min_310 branch 2 times, most recently from 2f06a3e to fa6dc2d Compare July 3, 2025 08:39


ARG BASIC_PIP_PKGS="numpy==1.21 pyarrow==11.0.0 pandas==2.0.0 six==1.16.0 scipy scikit-learn coverage unittest-xml-reporting"
ARG BASIC_PIP_PKGS="numpy==1.22.4 pyarrow==11.0.0 pandas==2.2.0 six==1.16.0 scipy scikit-learn coverage unittest-xml-reporting"
Copy link
Contributor Author

@zhengruifeng zhengruifeng Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous numpy==1.21 pandas==2.0.0 no longer works with python 3.10, after testing a few combinations, I think we need to also upgrade them

dongjoon-hyun added a commit that referenced this pull request Jul 4, 2025
### What changes were proposed in this pull request?

This PR aims to remove Python 3.9 GitHub Action CI for `Apache Spark 4.1.0.`
- https://github.com/apache/spark/actions/workflows/build_python_3.9.yml

This PR doesn't aim to delete Python 3.9 infra image because it can be used by `branch-4.0`.

### Why are the changes needed?

Python 3.9 will reach the end of support on October.
- https://devguide.python.org/versions/#supported-versions

We are moving the minimum requirements to Python 3.10 already.
- #51259

### Does this PR introduce _any_ user-facing change?

No, this is an infra change.

### How was this patch tested?

Manual review because this is a removal of test coverage.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #51371 from dongjoon-hyun/SPARK-52680.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@zhengruifeng zhengruifeng marked this pull request as ready for review July 4, 2025 04:40
@zhengruifeng
Copy link
Contributor Author

Hi @HyukjinKwon @dongjoon-hyun @allisonwang-db @xinrong-meng
After upgrading the minimum version of python from 3.9 to 3.10, I found that pandas==2.0/2.1 no longer work (a batch of Python tests hang or fail). And 2.2.0 is the minimum version that passes all Python tests
So shall we also upgrade the minimum version of pandas to 2.2.0 in 4.1? (We can do it in separate PR)

@zhengruifeng zhengruifeng changed the title [SPARK-52561][PYTHON][INFRA] Upgrade the minimum version of Python to 3.10 [WIP][SPARK-52561][PYTHON][INFRA] Upgrade the minimum version of Python to 3.10 Jul 4, 2025
@zhengruifeng zhengruifeng marked this pull request as draft July 4, 2025 04:51
@zhengruifeng zhengruifeng changed the title [WIP][SPARK-52561][PYTHON][INFRA] Upgrade the minimum version of Python to 3.10 [SPARK-52561][PYTHON][INFRA] Upgrade the minimum version of Python to 3.10 Jul 4, 2025
@zhengruifeng zhengruifeng marked this pull request as ready for review July 4, 2025 05:58
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

cc @peter-toth

@zhengruifeng zhengruifeng deleted the py_min_310 branch July 7, 2025 01:45
asl3 pushed a commit to asl3/spark that referenced this pull request Jul 14, 2025
### What changes were proposed in this pull request?

This PR aims to remove Python 3.9 GitHub Action CI for `Apache Spark 4.1.0.`
- https://github.com/apache/spark/actions/workflows/build_python_3.9.yml

This PR doesn't aim to delete Python 3.9 infra image because it can be used by `branch-4.0`.

### Why are the changes needed?

Python 3.9 will reach the end of support on October.
- https://devguide.python.org/versions/#supported-versions

We are moving the minimum requirements to Python 3.10 already.
- apache#51259

### Does this PR introduce _any_ user-facing change?

No, this is an infra change.

### How was this patch tested?

Manual review because this is a removal of test coverage.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51371 from dongjoon-hyun/SPARK-52680.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
asl3 pushed a commit to asl3/spark that referenced this pull request Jul 14, 2025
… 3.10

### What changes were proposed in this pull request?
Upgrade the minimum version of Python to 3.10

### Why are the changes needed?
Python 3.9 is reaching its EOL

### Does this PR introduce _any_ user-facing change?
yes, doc change

### How was this patch tested?
PR builder with upgraded image

https://github.com/zhengruifeng/spark/actions/runs/16064529566/job/45340924656

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#51259 from zhengruifeng/py_min_310.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
haoyangeng-db pushed a commit to haoyangeng-db/apache-spark that referenced this pull request Jul 22, 2025
### What changes were proposed in this pull request?

This PR aims to remove Python 3.9 GitHub Action CI for `Apache Spark 4.1.0.`
- https://github.com/apache/spark/actions/workflows/build_python_3.9.yml

This PR doesn't aim to delete Python 3.9 infra image because it can be used by `branch-4.0`.

### Why are the changes needed?

Python 3.9 will reach the end of support on October.
- https://devguide.python.org/versions/#supported-versions

We are moving the minimum requirements to Python 3.10 already.
- apache#51259

### Does this PR introduce _any_ user-facing change?

No, this is an infra change.

### How was this patch tested?

Manual review because this is a removal of test coverage.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51371 from dongjoon-hyun/SPARK-52680.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
haoyangeng-db pushed a commit to haoyangeng-db/apache-spark that referenced this pull request Jul 22, 2025
… 3.10

### What changes were proposed in this pull request?
Upgrade the minimum version of Python to 3.10

### Why are the changes needed?
Python 3.9 is reaching its EOL

### Does this PR introduce _any_ user-facing change?
yes, doc change

### How was this patch tested?
PR builder with upgraded image

https://github.com/zhengruifeng/spark/actions/runs/16064529566/job/45340924656

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#51259 from zhengruifeng/py_min_310.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants