Pipelines knowledge bases reference
This reference documentation for Pipelines knowledge bases includes information on the functions and views available in the aidb extension related to knowledge bases.
Views
aidb.knowledge_bases
Also referenceable as aidb.kbs
, the aidb.knowledge_bases
view provides a list of all knowledge bases in the database. It includes information about the knowledge base name, the model used, and the source data type.
Column | Type | Description |
---|---|---|
id | integer | |
name | text | Name of the knowledge base. |
vector_schema | text | Schema for vector_table. |
vector_table | text | Name of the table where the embeddings are stored. Gets newly created if it doesn’t exist. Managed by aidb. |
vector_key_column | text | Column to use to store the key that references the key in source data when computing embeddings. We recommend using the default and letting aidb manage this table. |
vector_data_column | text | Column to store embeddings in. We recommend using the default and letting aidb manage this table. |
model_name | text | Name of the model to use for embedding computation and retrievals. |
topk | integer | Default number of results to return during a retrieve. Similar to LIMIT in SQL. |
distance_operator | aidb.DistanceOperator | During retrieval, the vector operation to use to compare the vectors. |
options | jsonb | Unused. |
auto_processing | aidb.PipelineAutoProcessingMode | Auto-processing mode. |
owner_role | text | The Postgres role who created this pipeline. Background auto-processing will run the pipeline as this role. |
batch_size | integer | How many records to process concurrently when running this pipeline. |
background_sync_interval | interval | Used for background auto-processing. This is the interval between pipeline executions. |
source_type | text | Indicates whether this pipeline uses a table or volume as data source. |
source_schema | text | Schema for source_table. |
source_table | text | Name of the table used as input for the pipeline. Unused if the knowledge base uses a volume as source. |
source_data_column | text | Column name in the source table that Pipelines computes embeddings for. This is also the column that's returned in retrieve operations. |
source_data_format | aidb.PipelineDataFormat | Format of the data the knowledge base is working with. Uses type aidb.PipelineDataFormat . |
source_key_column | text | Column to use as key for storing the embedding in the vector table. This provides a reference from the embedding to the source data. |
source_volume_name | text | Name of the volume to use as a data source. Only applicable to knowledge bases configured with aidb.create_volume_knowledge_base(). |
aidb.knowledge_base_stats
Also referenceable as aidb.kbstat
, the aidb.knowledge_base_stats
view provides current statistics about auto processing for knowledge base pipelines.
Column | Type | Description |
---|---|---|
knowledge base | text | Name of the knowledge base. |
auto processing | aidb.PipelineAutoProcessingMode | Auto-processing mode. |
table: unprocessed rows | bigint | For table knowledge bases: The number of unprocessed rows that are new or might have changed since the last execution. |
volume: scans completed | bigint | For volume knowledge bases: The number of full listings of the source volume that were performed so far. |
count(source records) | bigint | The number of records in the source of the knowledge base. For volume knowledge bases, this is number of records seen during the last full scan. |
count(embeddings) | bigint | The number of embeddings stored in the destination table. |
Types
aidb.DistanceOperator
The aidb.DistanceOperator
type is an enum that represents the distance operators that can be used during retrieval.
Value | Description |
---|---|
L2 | Euclidean distance |
Inner | Inner product |
Cosine | Cosine similarity |
L1 | L1 distance |
Hamming | Hamming distance |
Jaccard | Jaccard distance |
SQL definition:
CREATE TYPE DistanceOperator AS ENUM ( 'L2', 'InnerProduct', 'Cosine', 'L1', 'Hamming', 'Jaccard' );
aidb.PipelineDataFormat
The aidb.PipelineDataFormat
type is an enum that represents the data formats that can be used as source data.
Value | Description |
---|---|
Text | Text data |
Image | Image data |
PDF data |
SQL definition:
CREATE TYPE PipelineDataFormat AS ENUM ( 'Text', 'Image', 'Pdf' );
aidb.PipelineAutoProcessingMode
The aidb.PipelineAutoProcessingMode
type is an enum used to define how auto processing for a Pipeline shoud behave (e.g. a knowledge base or a Preparer).
Value | Description |
---|---|
Live | New data is processed immediately while being added (using Postgres Triggers) |
Background | Continuous processing in the background (using Postgres background workers) |
Disabled | No automated processing |
CREATE TYPE PipelineAutoProcessingMode AS ENUM ( 'Live', 'Background', 'Disabled' );
Functions
aidb.create_table_knowledge_base
Creates a knowledge base for a given table.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
name | TEXT | Required | Name of the knowledge base. |
model_name | TEXT | Required | Name of the model to use. |
source_table | regclass | Required | Name of the table to use as source. |
source_data_column | TEXT | Required | Column name in source table to use. |
source_data_format | aidb.PipelineDataFormat | Required | Format of data in that column ("Text", "Image", "PDF"). |
source_key_column | TEXT | 'id' | Column to use as key to reference the rows. |
vector_table | TEXT | NULL | |
vector_data_column | TEXT | 'embeddings' | |
vector_key_column | TEXT | 'id' | |
topk | INTEGER | 1 | |
distance_operator | aidb.distanceoperator | 'L2' | |
options | JSONB | '{}'::JSONB | Unused |
auto_processing | aidb.PipelineAutoProcessingMode | 'Disabled' | Configure auto-processing for this pipeline. |
index_type | TEXT | 'vector' | Type of index to use for the vector table. |
batch_size | int | 100 | How many records to process concurrently. |
background_sync_interval | interval | '30 seconds' | Interval between pipeline executions if background auto-processing is configured. |
Index_types
If
index_type
is set tovector
, the system will automatically create a hnsw index on the vector table based on the distance operator used in the knowledge base. This is the default index type. Thevector
index type is only able to support 2000 dimensions or less. If more dimensions are needed, the index type should be set todisabled
.distance_operator index_type L2 vector_l2_ops
InnerProduct vector_ip_ops
Cosine vector_cosine_ops
L1 vector_l1_ops
If
index_type
is set toivfflat
, the system will create a IVFFlat index on the vector table.If
index_type
is set todisabled
, no index will be created.
Example
SELECT aidb.create_table_knowledge_base( name => 'test_knowledge_base', model_name => 'bert', source_table => 'test_source_table', source_data_column => 'content', source_data_type => 'Text' );
aidb.create_volume_knowledge_base
Creates a knowledge base for a given PGFS volume.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
name | TEXT | Required | Name of the knowledge base |
model_name | TEXT | Required | Name of the model |
source_volume_name | TEXT | Required | Name of the volume |
vector_table | TEXT | NULL | Name of the vector table |
vector_data_column | TEXT | 'embeddings' | Name of the vector column |
vector_key_column | TEXT | 'id' | Name of the key column |
topk | INTEGER | 1 | Number of results to return |
distance_operator | aidb.distanceoperator | 'L2' | Distance operator |
options | JSONB | '{}'::JSONB | Options |
auto_processing | aidb.PipelineAutoProcessingMode | 'Disabled' | Configure auto-processing for this pipeline. |
index_type | TEXT | 'vector' | Type of index to use for the vector table. |
batch_size | int | 100 | How many records to process concurrently. |
background_sync_interval | interval | '30 seconds' | Interval between pipeline executions if background auto-processing is configured. |
Index_types
If
index_type
is set tovector
, the system will automatically create a hnsw index on the vector table based on the distance operator used in the knowledge base. This is the default index type. Thevector
index type is only able to support 2000 dimensions or less. If more dimensions are needed, the index type should be set todisabled
.distance_operator index_type L2 vector_l2_ops
InnerProduct vector_ip_ops
Cosine vector_cosine_ops
L1 vector_l1_ops
If
index_type
is set toivfflat
, the system will create a IVFFlat index on the vector table.If
index_type
is set todisabled
, no index will be created.
Example
SELECT aidb.create_volume_knowledge_base( name => 'demo_vol_knowledge_base', model_name => 'simple_model', source_volume_name => 'demo_bucket_vol' );
aidb.set_auto_knowledge_base
Sets the auto-processing mode for this knowledge base.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
knowledge_base_name | TEXT | Name of knowledge base for which to enable auto-processing | |
mode | aidb.PipelineAutoProcessingMode | Desired auto-processing mode | |
batch_size | INTEGER DEFAULT NULL | How many records to process in one batch in Disabled (aka. Manual) or Background processing | |
background_sync_interval | INTERVAL DEFAULT NULL | Desired sync interval for background processing |
Example
SELECT aidb.set_auto_knowledge_base('test_knowledge_base', 'Live'); SELECT aidb.set_auto_knowledge_base('test_knowledge_base', 'Disabled', batch_size => 100); SELECT aidb.set_auto_knowledge_base('test_knowledge_base', 'Background'); SELECT aidb.set_auto_knowledge_base('test_knowledge_base', 'Background', batch_size => 100, background_sync_interval => '10 seconds');
aidb.bulk_embedding
Compute the embeddings for the source records in this pipeline and store them in vector destination.
The behavior of this function depends on the configured auto_processing mode:
- Live and Disabled auto-processing: Directly process all source records.
- Background auto-processing: Mark all source records for processing but don't perform the operation. The background worker will perform it.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
knowledge_base_name | TEXT | Name of knowledge base for which to generate embeddings. | |
silent | BOOL | false | Disable printing status and progress logs. |
Example
aidb=# SELECT aidb.bulk_embedding('kb_volume_image_manual');
INFO: kb_volume_image_manual: (re)setting state table to process all data... INFO: kb_volume_image_manual: Starting... Batch size 10, count(known source records): 0, scans completed: 0, count(embeddings): 0 INFO: kb_volume_image_manual: Batch iteration finished, count(known source records): 10, scans completed: 0, count(embeddings): 10 [...] INFO: kb_volume_image_manual: Batch iteration finished, count(known source records): 177, scans completed: 0, count(embeddings): 177 INFO: kb_volume_image_manual: finished, count(known source records): 177, scans completed: 1, count(embeddings): 177
aidb.retrieve_key
Retrieves a key from matching embeddings without looking up the source data.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
knowledge_base_name | TEXT | Name of knowledge base to use for retrieval | |
query_string | TEXT | Query string to use for retrieval | |
number_of_results | INTEGER | 0 | Number of results to return |
Example
SELECT * FROM aidb.retrieve_key('test_knowledge_base', 'shoes', 2);
key | distance -------+-------------------- 43941 | 0.2938963414490189 19337 | 0.3023805122617119 (2 rows)
aidb.retrieve_text
Retrieves the source text data from matching embeddings by joining the embeddings with the source table.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
knowledge_base_name | TEXT | Name of knowledge base to use for retrieval | |
query_string | TEXT | Query string to use for retrieval | |
number_of_results | INTEGER | 0 | Number of results to return |
Returns
Column | Type | Description |
---|---|---|
key | text | Key of the retrieved data |
value | text | Value of the retrieved data |
distance | double precision | Distance of the retrieved data from the query |
Example
SELECT * FROM aidb.retrieve_text('test_knowledge_base', 'jacket', 2);
key | value | distance -------+----------------------------------------------------+-------------------- 19337 | United Colors of Benetton Men Stripes Black Jacket | 0.2994317672742334 55018 | Lakme 3 in 1 Orchid Aqua Shine Lip Color | 0.3804609668507203 (2 rows)
aidb.delete_knowledge_base
Deletes only the knowledge base's configuration from the database.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
knowledge_base_name | TEXT | Name of knowledge base to delete |
Example
select aidb.delete_knowledge_base('test_knowledge_base');
delete_knowledge_base ------------------ (1 row)
aidb.create_volume
Creates a volume from a PGFS storage location for use as a data source in knowledge bases.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
name | TEXT | Name of the volume to create | |
server_name | TEXT | Name of the storage location to use for the volume | |
path | TEXT | Path to the volume in the storage location | |
mime_type | TEXT | Type of data in the volume (Text or Image) |
Example
select aidb.create_volume('demo_bucket_vol', 'demo_bucket', 'demo_bucket/demo_folder', 'Text');
create_volume --------------- (1 row)
aidb.list_volumes
Lists all the volumes that were created in the database.
Example
select * from aidb.list_volumes();
aidb.delete_volume
Deletes a volume from the database.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
volume_name | TEXT | Name of the volume to delete |
Example
select aidb.delete_volume('demo_bucket_vol');
delete_volume ------------- (1 row)
Could this page be better? Report a problem or suggest an addition!