|
| 1 | +--- |
| 2 | +title: "Pipeline Auto-Processing" |
| 3 | +navTitle: "Auto-Processing" |
| 4 | +description: "Pipeline Auto-Processing" |
| 5 | +--- |
| 6 | + |
| 7 | +## Overview |
| 8 | +Pipeline Auto-Processing is designed to keep source data and pipeline output in sync. Without this capability, users would have to |
| 9 | +manually trigger processing or provide external scripts, schedulers or triggers: |
| 10 | +- **Full sync:** Insert/delete/update are all handled automatically. No lost updates, missing data or stale records. |
| 11 | +- **Change detection:** Only new or changed records are processed. No unnecessary re-processing of known records. |
| 12 | +- **Batch processing:** Records are grouped into batches to be processed concurrently. Reducing overhead and achieving optimal performance e.g. with GPU-based AI model inference tasks. |
| 13 | +- **Background processing:** When enabled, the pipeline runs in a background worker process so that it doesn't block or delay other DB operations. Ideal for processing huge datasets. |
| 14 | +- **Live processing for Postgres tables:** When the data source is a Postgres table, live trigger-based auto processing can be enabled so that pipeline results are always guaranteed up to date. |
| 15 | +- **Quick turnaround:** Once a batch has finished processing, the results are immediately available. No full listing of the source data is needed to start processing. This is important for large external volumes where a full listing can take a long time. |
| 16 | + |
| 17 | +### Example for knowledge base pipeline |
| 18 | +A knowledge base is created for a Postgres table containing products with product descriptions. |
| 19 | +The user configures background auto-processing to always keep embeddings in sync without blocking or delaying any operations on the products table. |
| 20 | + |
| 21 | +The pipeline will process any pre-existing product records in the background, the user can query the statistics table to see the progress. |
| 22 | + |
| 23 | +The background process will run when new data is inserted, existing data modified or deleted. |
| 24 | + |
| 25 | +Queries on the knowledge base (i.e. retrieval operations) will always return accurate results within a small background processing delay. |
| 26 | + |
| 27 | + |
| 28 | +### Supported pipelines and modes |
| 29 | +#### Knowledge base pipeline |
| 30 | + |
| 31 | +| Source Type | Destination Type | Live | Background | Disabled (Manual) | |
| 32 | +|-------------|------------------|------|------------|-------------------| |
| 33 | +| Table | Table | ✅ | ✅ | ✅ | |
| 34 | +| Volume | Table | ❌ | ✅ | ✅ | |
| 35 | +_Outputting to volumes is not supported on knowledge base pipelines. A database with vector capabilities is necessary._ |
| 36 | + |
| 37 | + |
| 38 | +#### Preparer pipeline |
| 39 | +| Source Type | Destination Type | Live | Background | Disabled (Manual) | |
| 40 | +|-------------|------------------|------|------------|-------------------| |
| 41 | +| Table | Table | ✅ | ❌ | ✅ | |
| 42 | +| Table | Volume | ❌ | ❌ | ✅ | |
| 43 | +| Volume | Table | ❌ | ❌ | ✅ | |
| 44 | +| Volume | Volume | ❌ | ❌ | ✅ | |
| 45 | + |
| 46 | +_The preparer pipeline does not yet support batch processing and background auto-processing._ |
| 47 | + |
| 48 | +## Auto-Processing modes |
| 49 | +The following Auto-Processing modes are available to suit different requirements and use-cases. |
| 50 | + |
| 51 | + |
| 52 | +### Live |
| 53 | +AIDB sets up Postgres Triggers on the source table to immediately process any changes. Processing happens within the trigger function. |
| 54 | +This means it happens within the same transaction that modifies the data, guaranteeing up-to-date results. |
| 55 | + |
| 56 | +#### Pros & Cons |
| 57 | +- Transactional guarantee / immediate results. Pipeline results are always up to date with the source data. |
| 58 | +- Blocks / delays operations on the source data. Modifying transactions on the source data are delayed until processing is complete. |
| 59 | + |
| 60 | +### Background |
| 61 | +AIDB starts a Postgres background worker for each pipeline that has background auto-processing configured. |
| 62 | +Processing happens asynchronously based on a configurable `background_sync_interval`. See [change detection below](#change-detection) for details on how the pipelines are processed. |
| 63 | + |
| 64 | +!!! Note |
| 65 | +Make sure Postgres allows running enough background workers for the number of pipelines where you wish to use this processing mode. This is controlled by the Postgres setting `max_worker_processes`. |
| 66 | +!!! |
| 67 | + |
| 68 | +#### Pros & Cons |
| 69 | +- Asynchronous execution means queries on the source don't have to be delayed while the changes are processed. |
| 70 | +- Results are delayed and might become backlogged. |
| 71 | +- Ideal for huge datasets; processing occurs continuously in the background and is not tied to any user session / SQL function call. |
| 72 | + |
| 73 | +### Disabled |
| 74 | +Auto-processing is disabled. Users can manually call [`aidb.bulk_embedding()`](../reference/knowledge_bases#aidbbulk_embedding) to process the pipelines. |
| 75 | + |
| 76 | +_Note: On table knowledge bases, change detection is also disabled (since it requires active triggers on the source table). This means manual processing (via `aidb.bulk_embedding()`) has to process all the records in the source._ |
| 77 | + |
| 78 | + |
| 79 | + |
| 80 | + |
| 81 | +## Observability |
| 82 | +We provide detailed status and progress output for all auto-processing modes. |
| 83 | + |
| 84 | +A good place to get an overview is the statistics table. |
| 85 | +Look up the view [`aidb.knowledge_base_stats`](../reference/knowledge_bases#aidbknowledge_base_stats) or use its short alias `aidb.kbstat`. The view shows all configured knowledge base pipelines, |
| 86 | +which processing mode is set, and statistics about the processed records: |
| 87 | +```sql |
| 88 | +SELECT * from aidb.kbstat; |
| 89 | +__OUTPUT__ |
| 90 | + knowledge base | auto processing | table: unprocessed rows | volume: scans completed | count(source records) | count(embeddings) |
| 91 | +------------------------+-----------------+-------------------------+-------------------------+-----------------------+------------------- |
| 92 | + kb_table_text_bg | Background | 0 | | 15 | 15 |
| 93 | + kb_table_text_manual | Disabled | 0 | | 15 | 15 |
| 94 | + kb_table_image_manual | Disabled | 0 | | 3 | 3 |
| 95 | + kb_table_text_live | Live | 0 | | 15 | 15 |
| 96 | + kb_table_image_bg | Background | 0 | | 3 | 3 |
| 97 | + kb_volume_text_bg | Background | | 6 | 7 | 7 |
| 98 | + kb_volume_text_manual | Disabled | | 0 | 0 | 0 |
| 99 | + kb_volume_image_bg | Background | | 4 | 177 | 6 |
| 100 | + kb_volume_image_manual | Disabled | | 1 | 177 | 6 |
| 101 | +(9 rows) |
| 102 | +``` |
| 103 | + |
| 104 | +The [change detection](#change-detection) mechanism is central to how auto-processing works. It is different for volume and table sources. |
| 105 | +For this reason, the stats table has different columns for these two source types. |
| 106 | + |
| 107 | +* `table: unprocessed rows`: How many unique rows are listed in the backlog of change events. |
| 108 | + * If auto-processing is disabled, no (new) change events are captured. |
| 109 | +* `volume: scans completed`: How many full listings of the source have been completed so far. |
| 110 | +* `count(source records)`: How many records exist in the source for this pipeline. |
| 111 | + * for table sources, this number is always accurate. |
| 112 | + * for volume sources, we can only update this number after a full scan has completed. |
| 113 | +* `count(embeddings)`: How many embeddings exist in the vector destination table for this pipeline. |
| 114 | + |
| 115 | + |
| 116 | + |
| 117 | +## Configuration |
| 118 | +Auto-processing can be configured at creation time: |
| 119 | +- With [`aidb.create_table_knowledge_base`](../reference/knowledge_bases#aidbcreate_table_knowledge_base) |
| 120 | +- With [`aidb.create_volume_knowledge_base`](../reference/knowledge_bases#aidbcreate_volume_knowledge_base) |
| 121 | + |
| 122 | +As well as for existing pipelines: |
| 123 | +- With [`aidb.set_auto_knowledge_base`](../reference/knowledge_bases#aidbset_auto_knowledge_base) |
| 124 | + |
| 125 | +## Batch processing |
| 126 | +In Background and Disabled modes, (auto) processing happens in batches of configurable size. Within each batch, |
| 127 | + |
| 128 | +## Change detection |
| 129 | +AIDB auto-processing is designed around change detection mechanisms for table and volume data sources. This allows it to only |
| 130 | +process data when necessary. |
| 131 | + |
| 132 | +### Table sources |
| 133 | +When background auto-processing is enabled, Postgres triggers are set up on the source table to detect changes. These triggers are very lightweight. |
| 134 | +They only record change events and insert them into a "change events" table. No actual processing happens in the trigger function. |
| 135 | + |
| 136 | +The background worker will then process these events asynchronously. |
| 137 | + |
| 138 | +### Volume sources |
| 139 | +This source type provides a `last_modified` timestamp for each source record. The system keeps track of those timestamps in a "state" table. |
| 140 | +In each pipeline execution, the system lists the contents of the volume and compares it to the timestamps to see whether any records have changed or were added. |
| 141 | + |
| 142 | +This mechanism works in disabled and in background auto-processing. |
| 143 | + |
| 144 | +The system detects deleted objects after a full listing is complete. Only then can it be certain that a previously processed record is no longer present in the source. |
| 145 | + |
| 146 | +Unfortunately, object stores (and other external storage locations supported by our volumes) have limited query capabilities. This means: |
| 147 | +!!! Note |
| 148 | +Change detection for volumes is based on polling i.e., repeated listing. This might be an expensive operation when using cloud object stores like AWS S3. |
| 149 | +You can use a long `background_sync_interval` (like one per day) on pipelines with volume sources to control this cost. |
| 150 | +!!! |
0 commit comments