Skip to content

Commit f007377

Browse files
committed
feat: Add support for pgvector's vector data type
1 parent 86bb083 commit f007377

File tree

10 files changed

+189
-19
lines changed

10 files changed

+189
-19
lines changed

.github/workflows/ci_workflow.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ jobs:
3636
pipx install poetry
3737
- name: Install dependencies
3838
run: |
39-
poetry install
39+
poetry install --all-extras
4040
- name: Run pytest
4141
run: |
4242
poetry run pytest --capture=no

README.md

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ tap-carbon-intensity | target-postgres --config /path/to/target-postgres-config.
102102

103103
```bash
104104
pipx install poetry
105-
poetry install
105+
poetry install --all-extras
106106
pipx install pre-commit
107107
pre-commit install
108108
```
@@ -152,6 +152,8 @@ develop your own Singer taps and targets.
152152

153153
## Data Types
154154

155+
### Mapping
156+
155157
The below table shows how this tap will map between jsonschema datatypes and Postgres datatypes.
156158

157159
| jsonschema | Postgres |
@@ -202,7 +204,20 @@ The below table shows how this tap will map between jsonschema datatypes and Pos
202204

203205
Note that while object types are mapped directly to jsonb, array types are mapped to a jsonb array.
204206

205-
If a column has multiple jsonschema types, the following order is using to order Postgres types, from highest priority to lowest priority.
207+
When using [pgvector], this type mapping applies, additionally to the table above.
208+
209+
| jsonschema | Postgres |
210+
|------------------------------------------------|----------|
211+
| array (with additional SCHEMA annotations [1]) | vector |
212+
213+
[1] `"additionalProperties": {"storage": {"type": "vector", "dim": 4}}`
214+
215+
### Resolution Order
216+
217+
If a column has multiple jsonschema types, there is a priority list for
218+
resolving the best type candidate, from the highest priority to the
219+
lowest priority.
220+
206221
- ARRAY(JSONB)
207222
- JSONB
208223
- TEXT
@@ -215,3 +230,9 @@ If a column has multiple jsonschema types, the following order is using to order
215230
- INTEGER
216231
- BOOLEAN
217232
- NOTYPE
233+
234+
When using [pgvector], the `pgvector.sqlalchemy.Vector` type is added to the bottom
235+
of the list.
236+
237+
238+
[pgvector]: https://github.com/pgvector/pgvector

docker-compose.yml

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
version: "2.1"
44
services:
55
postgres:
6-
image: docker.io/postgres:latest
6+
image: ankane/pgvector:latest
77
command: postgres -c ssl=on -c ssl_cert_file=/var/lib/postgresql/server.crt -c ssl_key_file=/var/lib/postgresql/server.key -c ssl_ca_file=/var/lib/postgresql/ca.crt -c hba_file=/var/lib/postgresql/pg_hba.conf
88
environment:
99
POSTGRES_USER: postgres
@@ -13,16 +13,19 @@ services:
1313
POSTGRES_INITDB_ARGS: --auth-host=cert
1414
# Not placed in the data directory (/var/lib/postgresql/data) because of https://gist.github.com/mrw34/c97bb03ea1054afb551886ffc8b63c3b?permalink_comment_id=2678568#gistcomment-2678568
1515
volumes:
16+
- ./target_postgres/tests/init.sql:/docker-entrypoint-initdb.d/init.sql
1617
- ./ssl/server.crt:/var/lib/postgresql/server.crt # Certificate verifying the server's identity to the client.
1718
- ./ssl/server.key:/var/lib/postgresql/server.key # Private key to verify the server's certificate is legitimate.
1819
- ./ssl/ca.crt:/var/lib/postgresql/ca.crt # Certificate authority to use when verifying the client's identity to the server.
1920
- ./ssl/pg_hba.conf:/var/lib/postgresql/pg_hba.conf # Configuration file to allow connection over SSL.
2021
ports:
2122
- "5432:5432"
2223
postgres_no_ssl: # Borrowed from https://github.com/MeltanoLabs/tap-postgres/blob/main/.github/workflows/test.yml#L13-L23
23-
image: docker.io/postgres:latest
24+
image: ankane/pgvector:latest
2425
environment:
2526
POSTGRES_PASSWORD: postgres
27+
volumes:
28+
- ./target_postgres/tests/init.sql:/docker-entrypoint-initdb.d/init.sql
2629
ports:
2730
- 5433:5432
2831
ssh:
@@ -37,17 +40,20 @@ services:
3740
- PASSWORD_ACCESS=false
3841
- USER_NAME=melty
3942
volumes:
43+
- ./target_postgres/tests/init.sql:/docker-entrypoint-initdb.d/init.sql
4044
- ./ssh_tunnel/ssh-server-config:/config/ssh_host_keys:ro
4145
ports:
4246
- "127.0.0.1:2223:2222"
4347
networks:
4448
- inner
4549
postgresdb:
46-
image: postgres:13.0
50+
image: ankane/pgvector:latest
4751
environment:
4852
POSTGRES_USER: postgres
4953
POSTGRES_PASSWORD: postgres
5054
POSTGRES_DB: main
55+
volumes:
56+
- ./target_postgres/tests/init.sql:/docker-entrypoint-initdb.d/init.sql
5157
networks:
5258
inner:
5359
ipv4_address: 10.5.0.5

poetry.lock

Lines changed: 57 additions & 7 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ packages = [
3434
python = "<3.13,>=3.8.1"
3535
requests = "^2.25.1"
3636
singer-sdk = ">=0.28,<0.34"
37+
pgvector = { version="^0.2.4", optional = true }
3738
psycopg2-binary = "2.9.9"
3839
sqlalchemy = ">=2.0,<3.0"
3940
sshtunnel = "0.4.0"
@@ -51,11 +52,17 @@ types-simplejson = "^3.19.0.2"
5152
types-sqlalchemy = "^1.4.53.38"
5253
types-jsonschema = "^4.19.0.3"
5354

55+
[tool.poetry.extras]
56+
pgvector = ["pgvector"]
57+
5458
[tool.mypy]
5559
exclude = "tests"
5660

5761
[[tool.mypy.overrides]]
58-
module = ["sshtunnel"]
62+
module = [
63+
"pgvector.sqlalchemy",
64+
"sshtunnel",
65+
]
5966
ignore_missing_imports = true
6067

6168
[tool.isort]

target_postgres/connector.py

Lines changed: 61 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,14 @@ def prepare_table( # type: ignore[override]
115115
connection=connection,
116116
)
117117
return table
118+
# To make table reflection work properly with pgvector,
119+
# the module needs to be imported beforehand.
120+
try:
121+
from pgvector.sqlalchemy import Vector # noqa: F401
122+
except ImportError:
123+
self.logger.debug(
124+
"Unable to handle pgvector's `Vector` type. Please install `pgvector`."
125+
)
118126
meta.reflect(connection, only=[table_name])
119127
table = meta.tables[
120128
full_table_name
@@ -280,6 +288,51 @@ def pick_individual_type(jsonschema_type: dict):
280288
if "object" in jsonschema_type["type"]:
281289
return JSONB()
282290
if "array" in jsonschema_type["type"]:
291+
# Select between different kinds of `ARRAY` data types.
292+
#
293+
# This currently leverages an unspecified definition for the Singer SCHEMA,
294+
# using the `additionalProperties` attribute to convey additional type
295+
# information, agnostic of the target database.
296+
#
297+
# In this case, it is about telling different kinds of `ARRAY` types apart:
298+
# Either it is a vanilla `ARRAY`, to be stored into a `jsonb[]` type, or,
299+
# alternatively, it can be a "vector" kind `ARRAY` of floating point
300+
# numbers, effectively what pgvector is storing in its `VECTOR` type.
301+
#
302+
# Still, `type: "vector"` is only a surrogate label here, because other
303+
# database systems may use different types for implementing the same thing,
304+
# and need to translate accordingly.
305+
"""
306+
Schema override rule in `meltano.yml`:
307+
308+
type: "array"
309+
items:
310+
type: "number"
311+
additionalProperties:
312+
storage:
313+
type: "vector"
314+
dim: 4
315+
316+
Produced schema annotation in `catalog.json`:
317+
318+
{"type": "array",
319+
"items": {"type": "number"},
320+
"additionalProperties": {"storage": {"type": "vector", "dim": 4}}}
321+
"""
322+
if (
323+
"additionalProperties" in jsonschema_type
324+
and "storage" in jsonschema_type["additionalProperties"]
325+
):
326+
storage_properties = jsonschema_type["additionalProperties"]["storage"]
327+
if (
328+
"type" in storage_properties
329+
and storage_properties["type"] == "vector"
330+
):
331+
# On PostgreSQL/pgvector, use the corresponding type definition
332+
# from its SQLAlchemy dialect.
333+
from pgvector.sqlalchemy import Vector
334+
335+
return Vector(storage_properties["dim"])
283336
return ARRAY(JSONB())
284337
if jsonschema_type.get("format") == "date-time":
285338
return TIMESTAMP()
@@ -313,6 +366,13 @@ def pick_best_sql_type(sql_type_array: list):
313366
NOTYPE,
314367
]
315368

369+
try:
370+
from pgvector.sqlalchemy import Vector
371+
372+
precedence_order.append(Vector)
373+
except ImportError:
374+
pass
375+
316376
for sql_type in precedence_order:
317377
for obj in sql_type_array:
318378
if isinstance(obj, sql_type):
@@ -519,7 +579,7 @@ def _adapt_column_type( # type: ignore[override]
519579
return
520580

521581
# Not the same type, generic type or compatible types
522-
# calling merge_sql_types for assistnace
582+
# calling merge_sql_types for assistance.
523583
compatible_sql_type = self.merge_sql_types([current_type, sql_type])
524584

525585
if str(compatible_sql_type) == str(current_type):
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{"type": "SCHEMA", "stream": "array_float_vector", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}, "value": {"type": "array", "items": {"type": "number"}, "additionalProperties": {"storage": {"type": "vector", "dim": 4}}}}}}
2+
{"type": "RECORD", "stream": "array_float_vector", "record": {"id": 1, "value": [ 1.1, 2.1, 1.1, 1.3 ]}}
3+
{"type": "RECORD", "stream": "array_float_vector", "record": {"id": 2, "value": [ 1.0, 1.0, 1.0, 2.3 ]}}
4+
{"type": "RECORD", "stream": "array_float_vector", "record": {"id": 3, "value": [ 2.0, 1.2, 1.0, 0.9 ]}}
5+
{"type": "STATE", "value": {"array_float_vector": 3}}

target_postgres/tests/init.sql

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
CREATE EXTENSION IF NOT EXISTS vector;

target_postgres/tests/test_target_postgres.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -473,6 +473,26 @@ def test_array_boolean(postgres_target, helper):
473473
)
474474

475475

476+
def test_array_float_vector(postgres_target, helper):
477+
pgvector_sa = pytest.importorskip("pgvector.sqlalchemy")
478+
479+
file_name = "array_float_vector.singer"
480+
singer_file_to_target(file_name, postgres_target)
481+
row = {
482+
"id": 1,
483+
"value": "[1.1,2.1,1.1,1.3]",
484+
}
485+
helper.verify_data("array_float_vector", 3, "id", row)
486+
487+
helper.verify_schema(
488+
"array_float_vector",
489+
check_columns={
490+
"id": {"type": BIGINT},
491+
"value": {"type": pgvector_sa.Vector},
492+
},
493+
)
494+
495+
476496
def test_array_number(postgres_target, helper):
477497
file_name = "array_number.singer"
478498
singer_file_to_target(file_name, postgres_target)

0 commit comments

Comments
 (0)