Change DB dependencies to allow async #110

leo-mazzone · 2025-04-03T17:38:27Z

Context

Changes proposed in this pull request

Remove connectorx. Use ADBC instead
Use the real polars, instead of polars-lts-cpu
Upgrade psycopg2 to psycopg (but retain psycopg2 as a dev dependency for some tests, the ones that need to check different drivers not affecting the source address). Also install its async dependencies
Install async extension for SQLAlchemy
Make a few small changes to get tests to pass following changes in dependencies

Guidance to review

I agree, the way I'm passing both a SQLAlchemy engine as well as an adbc_connection to sql_to_df is ugly, but it also... works? The thing is that we're using SQLAlchemy for preparing statements, even when we don't use it for running those statements.

There is an alternative to this that I've been toying with: continuing to use polars.read_database_uri, using the ADBC mode instead of the connectorx mode. It seems to be suggested by the ADBC docs as well. However, it negates batching, and I couldn't find proof one would be faster than another. Maybe we should do an empirical test (later?).

Another thing I'm not doing is connection pooling - see here. I am creating a new connection every time I need it, which probably will take about half a second. It feels like it will be comparatively much shorter than the query if calling sql_to_df.

Relevant links

Installation instructions for psycopg
From ADBC docs: "the connection must be closed after usage or memory may leak"

Checklist:

My code follows the style guidelines of this project
New and existing unit tests pass locally with my changes
I've changed or updated relevant documentation

pyproject.toml

src/matchbox/common/db.py

src/matchbox/server/postgresql/db.py

src/matchbox/common/db.py

wpfl-dbt · 2025-04-04T07:03:24Z

Bar a couple of uncontroversial changes (the log name, for example) I would have approved 99% of this -- but I don't like passing two arguments to sql_to_df(). My solution is to detect the dialect and transpile with SQLGlot, which was already in the stack but now updated to be quicker with Rust. sql_to_df() can also now take a query string directly, so if you want to avoid SQLGlot you can just compile in the backend itself, if you prefer.

Also worth noting is that on my Intel machine tests now take 20-30 seconds longer, presumably from the connection pooling change. I think this is pretty significant, and might be palpably felt server-side. I'm raising a ticket to look into it.

wpfl-dbt · 2025-04-04T07:52:32Z

Added connection pooling. It didn't help with test speed, but we have connection pooling!

src/matchbox/server/postgresql/db.py

src/matchbox/common/db.py

leo-mazzone · 2025-04-04T09:05:34Z

Added a few comments as I'm thinking about these issues as part of my L&D today.

wpfl-dbt · 2025-04-04T10:10:25Z

Added a few comments as I'm thinking about these issues as part of my L&D today.

@leo-mazzone I propose separating the sql_to_df() concerns and rolling back the crap pooling implementation to just get this done.

wpfl-dbt

Happy

leo-mazzone added 5 commits April 3, 2025 17:17

Change DB dependencies

055a8ce

Use ADBC for backend query util

c958fd2

In server, pass adbc_connection to sql_to_df everywhere

9cfb72c

Merge branch 'main' into feature/async-db-dependencies

fdb6611

Use ADBC connection with context managers

bc34d87

wpfl-dbt reviewed Apr 4, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

wpfl-dbt reviewed Apr 4, 2025

View reviewed changes

src/matchbox/common/db.py Outdated Show resolved Hide resolved

wpfl-dbt reviewed Apr 4, 2025

View reviewed changes

src/matchbox/server/postgresql/db.py Outdated Show resolved Hide resolved

wpfl-dbt reviewed Apr 4, 2025

View reviewed changes

src/matchbox/common/db.py Outdated Show resolved Hide resolved

Brought in SQLGlot to transpile when ADBC is used with Select.

ebd5918

Added connection pooling.

8c795c3

Removed eager connection.

8f25122