Pipelines knowledge bases usage auto processing

Chosing the auto-processing mode

Also refer to auto-processing capabilities for more background.

You can choose between Live, Background, and Disabled auto processing. The mode can be configured when creating a knowledge base in the aidb.create_table_knowledge_base and aidb.create_volume_knowledge_base calls.

The mode can also be changed on existing knowledge bases, together with processing settings (like batch size or sync interval) in the aidb.set_auto_knowledge_base call.

Prepare the input data

The following examples all use this common source table:

CREATE TABLE test_source_table
(
    id               INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    content          TEXT NOT NULL
);
INSERT INTO test_source_table
VALUES (1, 'Catwalk Women Brown Heels'),
       (2, 'Lakme 3 in 1 Orchid  Aqua Shine Lip Color'),
       (3, 'United Colors of Benetton Men Stripes Black Jacket');

Using a knowledge base with live auto processing

This processing mode is only supported for table sources since it relies on Postgres Triggers. This mode also does not make use of batch processing since each record is processed individually, immediately when it is created/modified.

Create the knowledge base and configure live auto-processing:

SELECT aidb.create_table_knowledge_base(
    name => 'test_kb_live',
    model_name => 'bert',  -- this is a pre-defined locally running model
    source_table => 'test_source_table',
    source_data_column => 'content',
    source_data_format => 'Text',
    auto_processing => 'Live'
);
Output
INFO:  using vector table: public.test_kb_live_vector
NOTICE:  auto-processing is set to "Live". AIDB will process all new/updates rows going forward. To process existing data, manually run "SELECT aidb.bulk_embedding('test_kb_live');"
 create_table_knowledge_base
-----------------------------
 test_kb_live
(1 row)

Or manually set an existing knowledge base to live auto-processing:

SELECT aidb.set_auto_knowledge_base('test_kb_live', 'Live');

live auto-processing is now in effect for any changes on the source data going forward. But the existing data in the table was not processed yet.

Test this by adding a new record:

INSERT INTO test_source_table VALUES (4, 'This is also a Jacket');
Output
INFO:  Running live embedding for knowledge_base test_kb_live. key: "4" content: "This is also a Jacket"
INSERT 0 1

A query will now return the new record, which was auto-processed, but not the pre-existing records:

SELECT * FROM aidb.retrieve_text('test_kb_live', 'Jackets', topk=>4);
Output
key |         value         |      distance
-----+-----------------------+--------------------
 4   | This is also a Jacket | 0.9696326387986498
(1 row)

Now run the bulk_embedding call as shown in the NOTICE: above:

SELECT aidb.bulk_embedding('test_kb_live');
Output
INFO:  test_kb_live: (re)setting state table to process all data...
INFO:  test_kb_live: Starting... Batch size 100, unprocessed rows: 4, count(source records): 4, count(embeddings): 1
INFO:  test_kb_live: Batch iteration finished, unprocessed rows: 0, count(source records): 4, count(embeddings): 4
INFO:  test_kb_live: finished, unprocessed rows: 0, count(source records): 4, count(embeddings): 4
 bulk_embedding
----------------

(1 row)

Another retrieval now shows all rows being present:

SELECT * FROM aidb.retrieve_text('test_kb_live', 'Jackets', topk=>4);
Output
key |                       value                        |      distance
-----+----------------------------------------------------+--------------------
 4   | This is also a Jacket                              | 1.0353689874817347
 1   | Catwalk Women Brown Heels                          | 1.0404924577497399
 3   | United Colors of Benetton Men Stripes Black Jacket |  1.324741586641547
 2   | Lakme 3 in 1 Orchid  Aqua Shine Lip Color          | 1.4018234945521992
(4 rows)

The statistics view can be queried to see how many records are present, and how many were processed yet:

SELECT * FROM aidb.kbstat;
Output
 knowledge base | auto processing | table: unprocessed rows | volume: scans completed | count(source records) | count(embeddings)
----------------+-----------------+-------------------------+-------------------------+-----------------------+-------------------
 test_kb_live   | Live            |                       0 |                         |                     4 |                 4
(1 row)

Using a knowledge base with background auto processing

Background auto-processing is supported on all source types and data types.

Note

Background workers for AIDB must be configured in the Postgres config file; otherwise they will not run. Refer to our installation instructions for more details.

Create the knowledge base and configure background auto-processing. The batch size and sync interval settings are optional.

  • short sync intervals lead to shorter delays before embeddings are in sync again. But short delays also cause more processing overhead, especially with volume sources that require (potentially costly) requests to the storage system.
  • large batch sizes can lead to higher throughput up to a certain point. This is highly dependent on the infrastructure, data source, type of model, and even the data itself. We recommend users test different batch sizes.
SELECT aidb.create_table_knowledge_base(
    name => 'test_kb_background',
    model_name => 'bert',  -- this is a pre-defined locally running model
    source_table => 'test_source_table',
    source_data_column => 'content',
    source_data_format => 'Text',
    auto_processing => 'Background',
    batch_size => 50,
    background_sync_interval => '10 seconds'
);
Output
INFO:  using vector table: public.test_kb_background_vector
INFO:  test_kb_background: (Re)setting state table to process all data...
NOTICE:  auto-processing is set to "Background". Check progress with "SELECT * FROM aidb.kbstat;"
 create_table_knowledge_base
-----------------------------
 test_kb_background
(1 row)

Or manually set an existing knowledge base to background auto-processing:

SELECT aidb.set_auto_knowledge_base('test_kb_background', 'Background');
Note

When enabling background auto-processing on a pre-existing table knowledge base, only new change events will be processed. If the source table contains unprocessed source data, you can use the aidb.bulk_embedding() function to trigger re-processing of all records.

To change the batch size and / or the sync interval:

SELECT aidb.set_auto_knowledge_base('test_kb_background', 'Background', batch_size=>10);
SELECT aidb.set_auto_knowledge_base('test_kb_background', 'Background', background_sync_interval=>'10 minutes');
SELECT aidb.set_auto_knowledge_base('test_kb_background', 'Background', batch_size=>10, background_sync_interval=>'10 minutes');

We can now query the statistics like so:

SELECT * FROM aidb.kbstat;
Output
   knowledge base   | auto processing | table: unprocessed rows | volume: scans completed | count(source records) | count(embeddings)
--------------------+-----------------+-------------------------+-------------------------+-----------------------+-------------------
 test_kb_background | Background      |                       0 |                         |                     4 |                 4
(1 row)

When new data is inserted into the source table, AIDB table triggers will create a change event. This is shown in the table above as table: unprocessed rows. When the next background_sync_interval is expired, the background worker will process all these changed rows.

We can show this by inserting new records:

INSERT INTO test_source_table VALUES (5, 'This is also another Jacket');

and then checking the statistics again:

SELECT * FROM aidb.kbstat;
Output
   knowledge base   | auto processing | table: unprocessed rows | volume: scans completed | count(source records) | count(embeddings)
--------------------+-----------------+-------------------------+-------------------------+-----------------------+-------------------
 test_kb_background | Background      |                       1 |                         |                     5 |                 4
(1 row)

Using a knowledge base with disabled auto processing

In this mode, no auto-processing happens. Users can manually call aidb.bulk_embedding() or enable one of the two auto-processing modes with the aidb.set_auto_knowledge_base() call.


Could this page be better? Report a problem or suggest an addition!