EnterpriseDB
diff --git a/‎advocacy_docs/community/contributing/styleguide.mdx
+3-4 b/‎advocacy_docs/community/contributing/styleguide.mdx
+3-4
diff --git a/‎advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities/auto-processing.mdx
+150 b/‎advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities/auto-processing.mdx
+150
diff --git a/‎advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities.mdx renamed to ‎advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities/index.mdx
+13-14 b/‎advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities.mdx renamed to ‎advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities/index.mdx
+13-14
diff --git a/‎advocacy_docs/edb-postgres-ai/ai-accelerator/compatibility.mdx
+20-21 b/‎advocacy_docs/edb-postgres-ai/ai-accelerator/compatibility.mdx
+20-21
@@ -583,12 +583,11 @@ Term | Description | Example
 
 EDB docs uses two types of lists:
 
-* **Numbered** (ordered) &mdash; Use to list information that must appear in order, like tutorial steps.
+1. **Numbered** (ordered) &mdash; Use to list information that must appear in order, like tutorial steps.
     
-    **Bulleted** (unordered) &mdash; Use to list related information in an easy-to-read way.
-    
-    Introduce lists with a sentence and a colon. Use periods at the end of list items that are sentences or complete a sentence.
+- **Bulleted** (unordered) &mdash; Use to list related information in an easy-to-read way.
     
+Introduce lists with a sentence and a colon. Use periods at the end of list items that are sentences or complete a sentence.
 
 For each item of a **numbered** list, use `1.` followed by a period and a space, for example:
 
 
@@ -0,0 +1,150 @@
+---
+title: "Pipeline Auto-Processing"
+navTitle: "Auto-Processing"
+description: "Pipeline Auto-Processing"
+---
+
+## Overview
+Pipeline Auto-Processing is designed to keep source data and pipeline output in sync. Without this capability, users would have to
+manually trigger processing or provide external scripts, schedulers or triggers:
+- **Full sync:** Insert/delete/update are all handled automatically. No lost updates, missing data or stale records.
+- **Change detection:** Only new or changed records are processed. No unnecessary re-processing of known records.
+- **Batch processing:** Records are grouped into batches to be processed concurrently. Reducing overhead and achieving optimal performance e.g. with GPU-based AI model inference tasks.
+- **Background processing:** When enabled, the pipeline runs in a background worker process so that it doesn't block or delay other DB operations. Ideal for processing huge datasets.
+- **Live processing for Postgres tables:** When the data source is a Postgres table, live trigger-based auto processing can be enabled so that pipeline results are always guaranteed up to date.
+- **Quick turnaround:** Once a batch has finished processing, the results are immediately available. No full listing of the source data is needed to start processing. This is important for large external volumes where a full listing can take a long time.
+
+### Example for knowledge base pipeline
+A knowledge base is created for a Postgres table containing products with product descriptions.
+The user configures background auto-processing to always keep embeddings in sync without blocking or delaying any operations on the products table.
+
+The pipeline will process any pre-existing product records in the background, the user can query the statistics table to see the progress.
+
+The background process will run when new data is inserted, existing data modified or deleted.
+
+Queries on the knowledge base (i.e. retrieval operations) will always return accurate results within a small background processing delay.
+
+
+### Supported pipelines and modes
+#### Knowledge base pipeline
+
+| Source Type | Destination Type | Live | Background | Disabled (Manual) |
+|-------------|------------------|------|------------|-------------------|
+| Table       | Table            | ✅    | ✅          | ✅                 |
+| Volume      | Table            | ❌    | ✅          | ✅                 |
+_Outputting to volumes is not supported on knowledge base pipelines. A database with vector capabilities is necessary._
+
+
+#### Preparer pipeline
+| Source Type | Destination Type | Live | Background | Disabled (Manual) |
+|-------------|------------------|------|------------|-------------------|
+| Table       | Table            | ✅    | ❌          | ✅                 |
+| Table       | Volume           | ❌    | ❌          | ✅                 |
+| Volume      | Table            | ❌    | ❌          | ✅                 |
+| Volume      | Volume           | ❌    | ❌          | ✅                 |
+
+_The preparer pipeline does not yet support batch processing and background auto-processing._
+
+## Auto-Processing modes
+The following Auto-Processing modes are available to suit different requirements and use-cases.
+
+
+### Live
+AIDB sets up Postgres Triggers on the source table to immediately process any changes. Processing happens within the trigger function.
+This means it happens within the same transaction that modifies the data, guaranteeing up-to-date results.
+
+#### Pros & Cons
+- Transactional guarantee / immediate results. Pipeline results are always up to date with the source data.
+- Blocks / delays operations on the source data. Modifying transactions on the source data are delayed until processing is complete.
+
+### Background
+AIDB starts a Postgres background worker for each pipeline that has background auto-processing configured.
+Processing happens asynchronously based on a configurable `background_sync_interval`. See [change detection below](#change-detection) for details on how the pipelines are processed.
+
+!!! Note
+Make sure Postgres allows running enough background workers for the number of pipelines where you wish to use this processing mode. This is controlled by the Postgres setting `max_worker_processes`.
+!!!
+
+#### Pros & Cons
+- Asynchronous execution means queries on the source don't have to be delayed while the changes are processed.
+- Results are delayed and might become backlogged.
+- Ideal for huge datasets; processing occurs continuously in the background and is not tied to any user session / SQL function call.
+
+### Disabled
+Auto-processing is disabled. Users can manually call [`aidb.bulk_embedding()`](../reference/knowledge_bases#aidbbulk_embedding) to process the pipelines.
+
+_Note: On table knowledge bases, change detection is also disabled (since it requires active triggers on the source table). This means manual processing (via `aidb.bulk_embedding()`) has to process all the records in the source._
+
+
+
+
+## Observability
+We provide detailed status and progress output for all auto-processing modes.
+
+A good place to get an overview is the statistics table.
+Look up the view [`aidb.knowledge_base_stats`](../reference/knowledge_bases#aidbknowledge_base_stats) or use its short alias `aidb.kbstat`. The view shows all configured knowledge base pipelines,
+which processing mode is set, and statistics about the processed records:
+```sql
+SELECT * from aidb.kbstat;
+__OUTPUT__
+     knowledge base     | auto processing | table: unprocessed rows | volume: scans completed | count(source records) | count(embeddings)
+------------------------+-----------------+-------------------------+-------------------------+-----------------------+-------------------
+ kb_table_text_bg       | Background      |                       0 |                         |                    15 |                15
+ kb_table_text_manual   | Disabled        |                       0 |                         |                    15 |                15
+ kb_table_image_manual  | Disabled        |                       0 |                         |                     3 |                 3
+ kb_table_text_live     | Live            |                       0 |                         |                    15 |                15
+ kb_table_image_bg      | Background      |                       0 |                         |                     3 |                 3
+ kb_volume_text_bg      | Background      |                         |                       6 |                     7 |                 7
+ kb_volume_text_manual  | Disabled        |                         |                       0 |                     0 |                 0
+ kb_volume_image_bg     | Background      |                         |                       4 |                   177 |                 6
+ kb_volume_image_manual | Disabled        |                         |                       1 |                   177 |                 6
+(9 rows)
+```
+
+The [change detection](#change-detection) mechanism is central to how auto-processing works. It is different for volume and table sources.
+For this reason, the stats table has different columns for these two source types.
+
+* `table: unprocessed rows`: How many unique rows are listed in the backlog of change events.
+  * If auto-processing is disabled, no (new) change events are captured.
+* `volume: scans completed`: How many full listings of the source have been completed so far.
+* `count(source records)`: How many records exist in the source for this pipeline.
+  * for table sources, this number is always accurate.
+  * for volume sources, we can only update this number after a full scan has completed.
+* `count(embeddings)`: How many embeddings exist in the vector destination table for this pipeline.
+
+
+
+## Configuration
+Auto-processing can be configured at creation time:
+- With [`aidb.create_table_knowledge_base`](../reference/knowledge_bases#aidbcreate_table_knowledge_base)
+- With [`aidb.create_volume_knowledge_base`](../reference/knowledge_bases#aidbcreate_volume_knowledge_base)
+
+As well as for existing pipelines:
+- With [`aidb.set_auto_knowledge_base`](../reference/knowledge_bases#aidbset_auto_knowledge_base)
+
+## Batch processing
+In Background and Disabled modes, (auto) processing happens in batches of configurable size. Within each batch,
+
+## Change detection
+AIDB auto-processing is designed around change detection mechanisms for table and volume data sources. This allows it to only
+process data when necessary.
+
+### Table sources
+When background auto-processing is enabled, Postgres triggers are set up on the source table to detect changes. These triggers are very lightweight.
+They only record change events and insert them into a "change events" table. No actual processing happens in the trigger function.
+
+The background worker will then process these events asynchronously.
+
+### Volume sources
+This source type provides a `last_modified` timestamp for each source record. The system keeps track of those timestamps in a "state" table.
+In each pipeline execution, the system lists the contents of the volume and compares it to the timestamps to see whether any records have changed or were added.
+
+This mechanism works in disabled and in background auto-processing.
+
+The system detects deleted objects after a full listing is complete. Only then can it be certain that a previously processed record is no longer present in the source.
+
+Unfortunately, object stores (and other external storage locations supported by our volumes) have limited query capabilities. This means:
+!!! Note
+Change detection for volumes is based on polling i.e., repeated listing. This might be an expensive operation when using cloud object stores like AWS S3.
+You can use a long `background_sync_interval` (like one per day) on pipelines with volume sources to control this cost.
+!!!
@@ -16,7 +16,7 @@ Data for processing can be stored in the database in a table or in an external s
 If you want to use an external storage location to access data, you must create a storage location.
 This storage location can be an S3 bucket or a local file system.
 
-The storage locations can be used by AI Accelerator to create a volume. This volume can then be used by a retriever to access its data.
+The storage locations can be used by AI Accelerator to create a volume. This volume can then be used by a knowledge base to access its data.
 
 ### Create a preparer (optional)
 
@@ -44,49 +44,48 @@ When a preparer is created, by default it assumes column identifiers of "id" for
 
 ### Create a model
 
-Create a [model](models) with AI Accelerator Pipelines. This model can be a machine learning model, a deep learning model, or any other type of model that can be used for AI tasks.
+Create a [model](../models) with AI Accelerator Pipelines. This model can be a machine learning model, a deep learning model, or any other type of model that can be used for AI tasks.
 
-### Create a retriever
+### Create a knowledge base
 
-Create a retriever with AI Accelerator Pipelines. A retriever is a function that retrieves data from a table or volume and returns it in a format that the model can use.
+Create a knowledge base with AI Accelerator Pipelines. A knowledge base is a function that retrieves data from a table or volume and returns it in a format that the model can use.
 
-By default, a retriever only needs:
+By default, a knowledge base only needs:
 
 * A name
 * The name of a model to use
 
-If the retriever is for a table, it also needs:
+If the knowledge base is for a table, it also needs:
 
 * The name of the source table
 * The name of the column in the source table that contains the data
 * The data type of the column
 
-If the retriever is for a volume, it needs:
+If the knowledge base is for a volume, it needs:
 
 * The name of the volume
 * The name of the column in the volume that contains the data
 
-When you create a retriever, by default a vector table is created to store the embeddings of the data that's retrieved.
+When you create a knowledge base, by default a vector table is created to store the embeddings of the data that's retrieved.
 This table has a column to store the embeddings and a column to store the key of the data.
 
-When you create the retriever, you can specify the name of the vector table and the name of the vector column and the key column. This ability is useful if you're migrating to aidb and want to use an existing vector table.
+When you create the knowledge base, you can specify the name of the vector table and the name of the vector column and the key column. This ability is useful if you're migrating to aidb and want to use an existing vector table.
 
 ### Create embeddings
 
 Embedding sees the data being retrieved from the source table or volume and encoded into a vector datatype. That vector data is then stored in the vector table.
 
-If the source table already has data/rows when the retriever is created, then you need to make a manual *bulk embedding* call. This call generates the embeddings for all the existing data in the source table.
+See [auto-processing](auto-processing) to understand how embedding computation can be configured and run.
 
-You can then activate auto-embedding to keep the embeddings in sync going forward. Auto-embedding uses Postgres triggers to detect insertions and updates to the source table and generates embeddings for the new data.
 
 ### Query data
 
-You can query the embedded data using the retriever. The retriever can return the key to the data or the data itself, depending on the query. You can query the data using a text query or an image query, depending on the type of data that's being retrieved.
+You can query the embedded data using the knowledge base. The knowledge base can return the key to the data or the data itself, depending on the query. You can query the data using a text query or an image query, depending on the type of data that's being retrieved.
 
 ### Next steps
 
-While auto-embedding is enabled, the embeddings are always up to date, and applications can use the retriever to query the data as needed.
+While auto-processing is enabled, the embeddings are always up to date, and applications can use the knowledge base to query the data as needed.
 
 ### Cleanup
 
-If the embeddings are no longer required, you can delete the retriever, drop the vector table, and delete the model.
+If the embeddings are no longer required, you can delete the knowledge base, drop the vector table, and delete the model.
@@ -8,41 +8,40 @@ description: Compatibility information for the EDB Postgres AI - AI Accelerator
 
 ### Supported platforms
 
-* Ubuntu 22.04LTS and 24.04LTS on X86/64
-* Debian 12 (Bookworm) on X86/64
-* Redhat/RHEL 9/8 on X86/64
+* Ubuntu 22.04LTS and 24.04LTS on X86/64.
+* Debian 12 (Bookworm) on X86/64 and ARM64.
+* Redhat/RHEL 9/8 on X86/64.
+* Redhat/RHEL 9 on ARM64.
 
 ### Not currently supported
 
-* ARM architectures
-* SLES
-* Debian before the current version 12
-* Non-Linux platforms
+* SLES.
+* Debian before the current version 12.
+* Non-Linux platforms.
 
 ### Supported PostgreSQL versions
 
-* EDB Postgres Advanced Server Version 14, 15, 16, and 17
-* EBD Postgres Extended Version 14, 15, 16, and 17
-* PostgreSQL 14, 15, 16, and 17
+* EDB Postgres Advanced Server Version 14, 15, 16, and 17.
+* EDB Postgres Extended Version 14, 15, 16, and 17.
+* PostgreSQL 14, 15, 16, and 17.
 
 ## pgfs
 
 ### Supported platforms
 
-* Ubuntu 22.04LTS and 24.04LTS on X86/64
-* Debian 12 (Bookworm) on X86/64
-* Debian 12 (Bookworm) on X86/64 and ARM64
-* Redhat/RHEL 9/8 on X86/64
-* Redhat/RHEL 9 on ARM64
+* Ubuntu 22.04LTS and 24.04LTS on X86/64.
+* Debian 12 (Bookworm) on X86/64 and ARM64.
+* Redhat/RHEL 9/8 on X86/64.
+* Redhat/RHEL 9 on ARM64.
 
 ### Not currently supported
 
-* SLES
-* Debian before the current version 12
-* Non-Linux platforms
+* SLES.
+* Debian before the current version 12.
+* Non-Linux platforms.
 
 ### Supported PostgreSQL versions
 
-* EDB Postgres Advanced Server Version 14, 15, 16, and 17
-* EBD Postgres Extended Version 14, 15, 16, and 17
-* PostgreSQL 14, 15, 16, and 17
+* EDB Postgres Advanced Server Version 14, 15, 16, and 17.
+* EDB Postgres Extended Version 14, 15, 16, and 17.
+* PostgreSQL 14, 15, 16, and 17.