Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change DB dependencies to allow async #110

Merged
merged 10 commits into from
Apr 7, 2025

Conversation

leo-mazzone
Copy link
Collaborator

@leo-mazzone leo-mazzone commented Apr 3, 2025

Context

Changes proposed in this pull request

  • Remove connectorx. Use ADBC instead
  • Use the real polars, instead of polars-lts-cpu
  • Upgrade psycopg2 to psycopg (but retain psycopg2 as a dev dependency for some tests, the ones that need to check different drivers not affecting the source address). Also install its async dependencies
  • Install async extension for SQLAlchemy
  • Make a few small changes to get tests to pass following changes in dependencies

Guidance to review

I agree, the way I'm passing both a SQLAlchemy engine as well as an adbc_connection to sql_to_df is ugly, but it also... works? The thing is that we're using SQLAlchemy for preparing statements, even when we don't use it for running those statements.

There is an alternative to this that I've been toying with: continuing to use polars.read_database_uri, using the ADBC mode instead of the connectorx mode. It seems to be suggested by the ADBC docs as well. However, it negates batching, and I couldn't find proof one would be faster than another. Maybe we should do an empirical test (later?).

Another thing I'm not doing is connection pooling - see here. I am creating a new connection every time I need it, which probably will take about half a second. It feels like it will be comparatively much shorter than the query if calling sql_to_df.

Relevant links

Checklist:

  • My code follows the style guidelines of this project
  • New and existing unit tests pass locally with my changes
  • I've changed or updated relevant documentation

@wpfl-dbt
Copy link
Collaborator

wpfl-dbt commented Apr 4, 2025

Bar a couple of uncontroversial changes (the log name, for example) I would have approved 99% of this -- but I don't like passing two arguments to sql_to_df(). My solution is to detect the dialect and transpile with SQLGlot, which was already in the stack but now updated to be quicker with Rust. sql_to_df() can also now take a query string directly, so if you want to avoid SQLGlot you can just compile in the backend itself, if you prefer.

Also worth noting is that on my Intel machine tests now take 20-30 seconds longer, presumably from the connection pooling change. I think this is pretty significant, and might be palpably felt server-side. I'm raising a ticket to look into it.

@wpfl-dbt
Copy link
Collaborator

wpfl-dbt commented Apr 4, 2025

Added connection pooling. It didn't help with test speed, but we have connection pooling!

@leo-mazzone
Copy link
Collaborator Author

Added a few comments as I'm thinking about these issues as part of my L&D today.

@wpfl-dbt
Copy link
Collaborator

wpfl-dbt commented Apr 4, 2025

Added a few comments as I'm thinking about these issues as part of my L&D today.

@leo-mazzone I propose separating the sql_to_df() concerns and rolling back the crap pooling implementation to just get this done.

Copy link
Collaborator

@wpfl-dbt wpfl-dbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy

@leo-mazzone leo-mazzone merged commit e46f70a into main Apr 7, 2025
8 checks passed
@leo-mazzone leo-mazzone deleted the feature/async-db-dependencies branch April 7, 2025 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants