Skip to content

Commit 4db1742

Browse files
authored
Merge pull request #6747 from EnterpriseDB/release/2025-05-05a
Release/2025-05-05a
2 parents 3233219 + a415e2c commit 4db1742

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+2208
-1042
lines changed

advocacy_docs/community/contributing/styleguide.mdx

+3-4
Original file line numberDiff line numberDiff line change
@@ -583,12 +583,11 @@ Term | Description | Example
583583
584584
EDB docs uses two types of lists:
585585
586-
* **Numbered** (ordered) — Use to list information that must appear in order, like tutorial steps.
586+
1. **Numbered** (ordered) — Use to list information that must appear in order, like tutorial steps.
587587
588-
**Bulleted** (unordered) — Use to list related information in an easy-to-read way.
589-
590-
Introduce lists with a sentence and a colon. Use periods at the end of list items that are sentences or complete a sentence.
588+
- **Bulleted** (unordered) — Use to list related information in an easy-to-read way.
591589
590+
Introduce lists with a sentence and a colon. Use periods at the end of list items that are sentences or complete a sentence.
592591
593592
For each item of a **numbered** list, use `1.` followed by a period and a space, for example:
594593
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
---
2+
title: "Pipeline Auto-Processing"
3+
navTitle: "Auto-Processing"
4+
description: "Pipeline Auto-Processing"
5+
---
6+
7+
## Overview
8+
Pipeline Auto-Processing is designed to keep source data and pipeline output in sync. Without this capability, users would have to
9+
manually trigger processing or provide external scripts, schedulers or triggers:
10+
- **Full sync:** Insert/delete/update are all handled automatically. No lost updates, missing data or stale records.
11+
- **Change detection:** Only new or changed records are processed. No unnecessary re-processing of known records.
12+
- **Batch processing:** Records are grouped into batches to be processed concurrently. Reducing overhead and achieving optimal performance e.g. with GPU-based AI model inference tasks.
13+
- **Background processing:** When enabled, the pipeline runs in a background worker process so that it doesn't block or delay other DB operations. Ideal for processing huge datasets.
14+
- **Live processing for Postgres tables:** When the data source is a Postgres table, live trigger-based auto processing can be enabled so that pipeline results are always guaranteed up to date.
15+
- **Quick turnaround:** Once a batch has finished processing, the results are immediately available. No full listing of the source data is needed to start processing. This is important for large external volumes where a full listing can take a long time.
16+
17+
### Example for knowledge base pipeline
18+
A knowledge base is created for a Postgres table containing products with product descriptions.
19+
The user configures background auto-processing to always keep embeddings in sync without blocking or delaying any operations on the products table.
20+
21+
The pipeline will process any pre-existing product records in the background, the user can query the statistics table to see the progress.
22+
23+
The background process will run when new data is inserted, existing data modified or deleted.
24+
25+
Queries on the knowledge base (i.e. retrieval operations) will always return accurate results within a small background processing delay.
26+
27+
28+
### Supported pipelines and modes
29+
#### Knowledge base pipeline
30+
31+
| Source Type | Destination Type | Live | Background | Disabled (Manual) |
32+
|-------------|------------------|------|------------|-------------------|
33+
| Table | Table ||||
34+
| Volume | Table ||||
35+
_Outputting to volumes is not supported on knowledge base pipelines. A database with vector capabilities is necessary._
36+
37+
38+
#### Preparer pipeline
39+
| Source Type | Destination Type | Live | Background | Disabled (Manual) |
40+
|-------------|------------------|------|------------|-------------------|
41+
| Table | Table ||||
42+
| Table | Volume ||||
43+
| Volume | Table ||||
44+
| Volume | Volume ||||
45+
46+
_The preparer pipeline does not yet support batch processing and background auto-processing._
47+
48+
## Auto-Processing modes
49+
The following Auto-Processing modes are available to suit different requirements and use-cases.
50+
51+
52+
### Live
53+
AIDB sets up Postgres Triggers on the source table to immediately process any changes. Processing happens within the trigger function.
54+
This means it happens within the same transaction that modifies the data, guaranteeing up-to-date results.
55+
56+
#### Pros & Cons
57+
- Transactional guarantee / immediate results. Pipeline results are always up to date with the source data.
58+
- Blocks / delays operations on the source data. Modifying transactions on the source data are delayed until processing is complete.
59+
60+
### Background
61+
AIDB starts a Postgres background worker for each pipeline that has background auto-processing configured.
62+
Processing happens asynchronously based on a configurable `background_sync_interval`. See [change detection below](#change-detection) for details on how the pipelines are processed.
63+
64+
!!! Note
65+
Make sure Postgres allows running enough background workers for the number of pipelines where you wish to use this processing mode. This is controlled by the Postgres setting `max_worker_processes`.
66+
!!!
67+
68+
#### Pros & Cons
69+
- Asynchronous execution means queries on the source don't have to be delayed while the changes are processed.
70+
- Results are delayed and might become backlogged.
71+
- Ideal for huge datasets; processing occurs continuously in the background and is not tied to any user session / SQL function call.
72+
73+
### Disabled
74+
Auto-processing is disabled. Users can manually call [`aidb.bulk_embedding()`](../reference/knowledge_bases#aidbbulk_embedding) to process the pipelines.
75+
76+
_Note: On table knowledge bases, change detection is also disabled (since it requires active triggers on the source table). This means manual processing (via `aidb.bulk_embedding()`) has to process all the records in the source._
77+
78+
79+
80+
81+
## Observability
82+
We provide detailed status and progress output for all auto-processing modes.
83+
84+
A good place to get an overview is the statistics table.
85+
Look up the view [`aidb.knowledge_base_stats`](../reference/knowledge_bases#aidbknowledge_base_stats) or use its short alias `aidb.kbstat`. The view shows all configured knowledge base pipelines,
86+
which processing mode is set, and statistics about the processed records:
87+
```sql
88+
SELECT * from aidb.kbstat;
89+
__OUTPUT__
90+
knowledge base | auto processing | table: unprocessed rows | volume: scans completed | count(source records) | count(embeddings)
91+
------------------------+-----------------+-------------------------+-------------------------+-----------------------+-------------------
92+
kb_table_text_bg | Background | 0 | | 15 | 15
93+
kb_table_text_manual | Disabled | 0 | | 15 | 15
94+
kb_table_image_manual | Disabled | 0 | | 3 | 3
95+
kb_table_text_live | Live | 0 | | 15 | 15
96+
kb_table_image_bg | Background | 0 | | 3 | 3
97+
kb_volume_text_bg | Background | | 6 | 7 | 7
98+
kb_volume_text_manual | Disabled | | 0 | 0 | 0
99+
kb_volume_image_bg | Background | | 4 | 177 | 6
100+
kb_volume_image_manual | Disabled | | 1 | 177 | 6
101+
(9 rows)
102+
```
103+
104+
The [change detection](#change-detection) mechanism is central to how auto-processing works. It is different for volume and table sources.
105+
For this reason, the stats table has different columns for these two source types.
106+
107+
* `table: unprocessed rows`: How many unique rows are listed in the backlog of change events.
108+
* If auto-processing is disabled, no (new) change events are captured.
109+
* `volume: scans completed`: How many full listings of the source have been completed so far.
110+
* `count(source records)`: How many records exist in the source for this pipeline.
111+
* for table sources, this number is always accurate.
112+
* for volume sources, we can only update this number after a full scan has completed.
113+
* `count(embeddings)`: How many embeddings exist in the vector destination table for this pipeline.
114+
115+
116+
117+
## Configuration
118+
Auto-processing can be configured at creation time:
119+
- With [`aidb.create_table_knowledge_base`](../reference/knowledge_bases#aidbcreate_table_knowledge_base)
120+
- With [`aidb.create_volume_knowledge_base`](../reference/knowledge_bases#aidbcreate_volume_knowledge_base)
121+
122+
As well as for existing pipelines:
123+
- With [`aidb.set_auto_knowledge_base`](../reference/knowledge_bases#aidbset_auto_knowledge_base)
124+
125+
## Batch processing
126+
In Background and Disabled modes, (auto) processing happens in batches of configurable size. Within each batch,
127+
128+
## Change detection
129+
AIDB auto-processing is designed around change detection mechanisms for table and volume data sources. This allows it to only
130+
process data when necessary.
131+
132+
### Table sources
133+
When background auto-processing is enabled, Postgres triggers are set up on the source table to detect changes. These triggers are very lightweight.
134+
They only record change events and insert them into a "change events" table. No actual processing happens in the trigger function.
135+
136+
The background worker will then process these events asynchronously.
137+
138+
### Volume sources
139+
This source type provides a `last_modified` timestamp for each source record. The system keeps track of those timestamps in a "state" table.
140+
In each pipeline execution, the system lists the contents of the volume and compares it to the timestamps to see whether any records have changed or were added.
141+
142+
This mechanism works in disabled and in background auto-processing.
143+
144+
The system detects deleted objects after a full listing is complete. Only then can it be certain that a previously processed record is no longer present in the source.
145+
146+
Unfortunately, object stores (and other external storage locations supported by our volumes) have limited query capabilities. This means:
147+
!!! Note
148+
Change detection for volumes is based on polling i.e., repeated listing. This might be an expensive operation when using cloud object stores like AWS S3.
149+
You can use a long `background_sync_interval` (like one per day) on pipelines with volume sources to control this cost.
150+
!!!

advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities.mdx renamed to advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities/index.mdx

+13-14
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Data for processing can be stored in the database in a table or in an external s
1616
If you want to use an external storage location to access data, you must create a storage location.
1717
This storage location can be an S3 bucket or a local file system.
1818

19-
The storage locations can be used by AI Accelerator to create a volume. This volume can then be used by a retriever to access its data.
19+
The storage locations can be used by AI Accelerator to create a volume. This volume can then be used by a knowledge base to access its data.
2020

2121
### Create a preparer (optional)
2222

@@ -44,49 +44,48 @@ When a preparer is created, by default it assumes column identifiers of "id" for
4444

4545
### Create a model
4646

47-
Create a [model](models) with AI Accelerator Pipelines. This model can be a machine learning model, a deep learning model, or any other type of model that can be used for AI tasks.
47+
Create a [model](../models) with AI Accelerator Pipelines. This model can be a machine learning model, a deep learning model, or any other type of model that can be used for AI tasks.
4848

49-
### Create a retriever
49+
### Create a knowledge base
5050

51-
Create a retriever with AI Accelerator Pipelines. A retriever is a function that retrieves data from a table or volume and returns it in a format that the model can use.
51+
Create a knowledge base with AI Accelerator Pipelines. A knowledge base is a function that retrieves data from a table or volume and returns it in a format that the model can use.
5252

53-
By default, a retriever only needs:
53+
By default, a knowledge base only needs:
5454

5555
* A name
5656
* The name of a model to use
5757

58-
If the retriever is for a table, it also needs:
58+
If the knowledge base is for a table, it also needs:
5959

6060
* The name of the source table
6161
* The name of the column in the source table that contains the data
6262
* The data type of the column
6363

64-
If the retriever is for a volume, it needs:
64+
If the knowledge base is for a volume, it needs:
6565

6666
* The name of the volume
6767
* The name of the column in the volume that contains the data
6868

69-
When you create a retriever, by default a vector table is created to store the embeddings of the data that's retrieved.
69+
When you create a knowledge base, by default a vector table is created to store the embeddings of the data that's retrieved.
7070
This table has a column to store the embeddings and a column to store the key of the data.
7171

72-
When you create the retriever, you can specify the name of the vector table and the name of the vector column and the key column. This ability is useful if you're migrating to aidb and want to use an existing vector table.
72+
When you create the knowledge base, you can specify the name of the vector table and the name of the vector column and the key column. This ability is useful if you're migrating to aidb and want to use an existing vector table.
7373

7474
### Create embeddings
7575

7676
Embedding sees the data being retrieved from the source table or volume and encoded into a vector datatype. That vector data is then stored in the vector table.
7777

78-
If the source table already has data/rows when the retriever is created, then you need to make a manual *bulk embedding* call. This call generates the embeddings for all the existing data in the source table.
78+
See [auto-processing](auto-processing) to understand how embedding computation can be configured and run.
7979

80-
You can then activate auto-embedding to keep the embeddings in sync going forward. Auto-embedding uses Postgres triggers to detect insertions and updates to the source table and generates embeddings for the new data.
8180

8281
### Query data
8382

84-
You can query the embedded data using the retriever. The retriever can return the key to the data or the data itself, depending on the query. You can query the data using a text query or an image query, depending on the type of data that's being retrieved.
83+
You can query the embedded data using the knowledge base. The knowledge base can return the key to the data or the data itself, depending on the query. You can query the data using a text query or an image query, depending on the type of data that's being retrieved.
8584

8685
### Next steps
8786

88-
While auto-embedding is enabled, the embeddings are always up to date, and applications can use the retriever to query the data as needed.
87+
While auto-processing is enabled, the embeddings are always up to date, and applications can use the knowledge base to query the data as needed.
8988

9089
### Cleanup
9190

92-
If the embeddings are no longer required, you can delete the retriever, drop the vector table, and delete the model.
91+
If the embeddings are no longer required, you can delete the knowledge base, drop the vector table, and delete the model.

advocacy_docs/edb-postgres-ai/ai-accelerator/compatibility.mdx

+20-21
Original file line numberDiff line numberDiff line change
@@ -8,41 +8,40 @@ description: Compatibility information for the EDB Postgres AI - AI Accelerator
88

99
### Supported platforms
1010

11-
* Ubuntu 22.04LTS and 24.04LTS on X86/64
12-
* Debian 12 (Bookworm) on X86/64
13-
* Redhat/RHEL 9/8 on X86/64
11+
* Ubuntu 22.04LTS and 24.04LTS on X86/64.
12+
* Debian 12 (Bookworm) on X86/64 and ARM64.
13+
* Redhat/RHEL 9/8 on X86/64.
14+
* Redhat/RHEL 9 on ARM64.
1415

1516
### Not currently supported
1617

17-
* ARM architectures
18-
* SLES
19-
* Debian before the current version 12
20-
* Non-Linux platforms
18+
* SLES.
19+
* Debian before the current version 12.
20+
* Non-Linux platforms.
2121

2222
### Supported PostgreSQL versions
2323

24-
* EDB Postgres Advanced Server Version 14, 15, 16, and 17
25-
* EBD Postgres Extended Version 14, 15, 16, and 17
26-
* PostgreSQL 14, 15, 16, and 17
24+
* EDB Postgres Advanced Server Version 14, 15, 16, and 17.
25+
* EDB Postgres Extended Version 14, 15, 16, and 17.
26+
* PostgreSQL 14, 15, 16, and 17.
2727

2828
## pgfs
2929

3030
### Supported platforms
3131

32-
* Ubuntu 22.04LTS and 24.04LTS on X86/64
33-
* Debian 12 (Bookworm) on X86/64
34-
* Debian 12 (Bookworm) on X86/64 and ARM64
35-
* Redhat/RHEL 9/8 on X86/64
36-
* Redhat/RHEL 9 on ARM64
32+
* Ubuntu 22.04LTS and 24.04LTS on X86/64.
33+
* Debian 12 (Bookworm) on X86/64 and ARM64.
34+
* Redhat/RHEL 9/8 on X86/64.
35+
* Redhat/RHEL 9 on ARM64.
3736

3837
### Not currently supported
3938

40-
* SLES
41-
* Debian before the current version 12
42-
* Non-Linux platforms
39+
* SLES.
40+
* Debian before the current version 12.
41+
* Non-Linux platforms.
4342

4443
### Supported PostgreSQL versions
4544

46-
* EDB Postgres Advanced Server Version 14, 15, 16, and 17
47-
* EBD Postgres Extended Version 14, 15, 16, and 17
48-
* PostgreSQL 14, 15, 16, and 17
45+
* EDB Postgres Advanced Server Version 14, 15, 16, and 17.
46+
* EDB Postgres Extended Version 14, 15, 16, and 17.
47+
* PostgreSQL 14, 15, 16, and 17.

0 commit comments

Comments
 (0)