Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Spark Connect (SQL models) #899

Closed
wants to merge 2 commits into from

Conversation

vakarisbk
Copy link

@vakarisbk vakarisbk commented Oct 3, 2023

partially resolves dbt-labs/dbt-adapters#493
docs dbt-labs/docs.getdbt.com/#

Problem

dbt-spark has limited options for open-source Spark integrations. Currently, the only available method to run dbt with open-source Spark in production is through a Thrift connection. However, a Thrift connection isn't suitable for all use cases. For instance, it doesn't support thrift over HTTP. Also, the PyHive project, that dbt thrift relies on, is unsupported (at least according to their GitHub page).

Solution

Propose introducing support for Spark Connect (for SQL models only).

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

How to test locally?

  1. Follow the instructions in the Spark documentation to download Spark distribution. https://spark.apache.org/docs/latest/spark-connect-overview.html
  2. Start spark connect server with Hive metastore enabled ./start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0 --conf spark.sql.catalogImplementation=hive
  3. Add the Spark Connect configuration to your profiles.yml:
spark_connect:
  outputs:
    dev:
      host: localhost
      method: connect
      port: 15002
      schema: default
      type: spark
  target: dev

Known issues: dbt-labs/dbt-adapters#487

@cla-bot
Copy link

cla-bot bot commented Oct 3, 2023

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Vakaris.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

1 similar comment
@cla-bot
Copy link

cla-bot bot commented Oct 3, 2023

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Vakaris.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot cla-bot bot added the cla:yes label Oct 3, 2023
@vakarisbk vakarisbk changed the title [WIP] Add support for Spark Connect (SQL models) Add support for Spark Connect (SQL models) Oct 4, 2023
@vakarisbk vakarisbk marked this pull request as ready for review October 4, 2023 16:23
@vakarisbk vakarisbk requested a review from a team as a code owner October 4, 2023 16:23
@vakarisbk vakarisbk requested a review from VersusFacit October 4, 2023 16:23
setup.py Outdated
@@ -59,7 +59,16 @@ def _get_dbt_core_version():
"thrift>=0.11.0,<0.17.0",
]
session_extras = ["pyspark>=3.0.0,<4.0.0"]
all_extras = odbc_extras + pyhive_extras + session_extras
connect_extras = [
"pyspark==3.5.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support pyspark>=3.4.0,<4, or at least pyspark>=3.5.0,<4?

Copy link
Author

@vakarisbk vakarisbk Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyspark>=3.5.0,<4 added.
3.4.0 connect module has an issue where temporary views are not shared between queries. If one dbt query creates a temp view, another query cannot see it. Can't find a spark issue # now

@vakarisbk
Copy link
Author

Seeing as there is some recent activity on Issue dbt-labs/dbt-adapters#493, and knowing that there are at least a couple of people actively using this fork, I've updated it. Looking forward for any insights regarding the implementation, as well as the likelihood of this pr getting merged.

Copy link
Contributor

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Feb 14, 2025
Copy link
Contributor

Although we are closing this PR as stale, it can still be reopened to continue development. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ADAP-658] [Feature] Spark Connect as connection method
2 participants