Knowledge bases v1.4.0 (LTS)

A knowledge base is a vector store created by a pipeline whose last step is KnowledgeBase. It holds embeddings generated from your source data and supports semantic similarity search for retrieval-augmented generation (RAG) and other AI workloads.

Knowledge base list

Navigate to Sovereign AI > Knowledge Bases to see all knowledge bases in your project. Each entry shows:

  • Name: The knowledge base identifier, derived from the pipeline name.

  • Origin: The cluster and database where the knowledge base resides.

  • Model: The embedding model used to generate vectors.

  • Source Pipelines: The pipelines feeding data into this knowledge base. A knowledge base can have one or more source pipelines. When more than three pipelines are associated, the list shows the first three with a +N more link to the knowledge base detail page.

  • Source Records: Total rows across source tables.

  • Result Records: Number of vector embeddings stored.

  • Unprocessed Records: Source rows that haven't yet been processed.

Knowledge base detail

Select a knowledge base to view its detail page.

Record counts

MetricDescription
Source recordsTotal rows in the source table
EmbeddingsNumber of vector embeddings stored in the knowledge base
Unprocessed recordsSource rows that haven't yet been processed

When the pipeline is operating as expected, the unprocessed count is zero (for Background and Live modes) or equal to the source count minus embeddings (for On Demand mode).

Source pipelines

The detail page lists all pipelines that feed data into this knowledge base. A knowledge base can receive data from multiple pipelines (N:1 relationship), enabling you to aggregate data from different sources and processing chains into a single searchable knowledge base.

Multi-pipeline knowledge bases

Pipeline Designer supports multiple pipelines writing to the same knowledge base. This lets you aggregate data from different source tables or apply different processing chains (for example, one pipeline for PDF documents and another for HTML content) into a single knowledge base for unified semantic search.

To add a pipeline to an existing knowledge base:

  1. Create a new pipeline following the pipeline creation wizard.

  2. Add a KnowledgeBase step as the final step.

  3. In the Knowledge Base Table Name field, search for and select the existing knowledge base you want to append to.

  4. Select the Data Format (Text or Image). The model, distance operator, and index type are inherited from the existing knowledge base.

  5. Deploy the pipeline.

The new pipeline's data is appended to the existing knowledge base's vector table. Both the knowledge base list and detail pages show all source pipelines associated with each knowledge base.

Testing retrieval

The knowledge base detail page includes a built-in query tool for testing semantic search without writing SQL.

Running a test query

  1. On the knowledge base detail page, select Run Test Query.

  2. In the Perform Test Query Retrieval dialog, enter your search text in the Search Query field.

  3. Optionally adjust the Number of Results parameter.

  4. Click Run Query.

The query tool encodes your text using the knowledge base's embedding model, performs a vector similarity search, and returns the top matching records with their distance scores. Lower distance scores indicate closer semantic matches. The results appear in the same dialog beneath the query fields, with each matching record shown as a row containing the source key and its distance score.

Interpreting results

Each result includes:

  • Source key: The key column value from the original source row, allowing you to trace results back to source data.

  • Distance: The similarity distance between the query vector and the result vector, using the knowledge base's configured distance operator (L2 by default).

Test queries are useful for validating that your pipeline is producing meaningful embeddings and that the chunking strategy produces segments of appropriate granularity. If results seem poor, consider adjusting the ChunkText step's desired length or trying a different embedding model.

Knowledge base lifecycle

Knowledge bases are created automatically when you deploy a pipeline with a KnowledgeBase terminal step. They can't be created independently through Pipeline Designer.

When you delete a pipeline that has a KnowledgeBase step, the knowledge base is only deleted if no other pipelines are feeding into it. If other pipelines still reference the knowledge base, only the deleted pipeline's association is removed. If the deleted pipeline is the last one associated with the knowledge base, the knowledge base and its vector table are also deleted, and all stored embeddings are permanently removed.

Warning

Deleting a knowledge base is irreversible. All vector data is permanently removed. If the knowledge base is used by Langflow flows or other consumers, those integrations will break.

Using knowledge bases with Langflow

Knowledge bases created through Pipeline Designer are accessible from the Langflow-based Gen AI builder. The Langflow KB component connects to AIDB knowledge bases using the same aidb.retrieve_key() and aidb.retrieve_text() functions that power the Pipeline Designer query tool.

To use a Pipeline Designer knowledge base in Langflow:

  1. Create and populate a knowledge base through Pipeline Designer as described above.

  2. In Langflow, add a KB component to your flow.

  3. Configure the component with the database connection details and the knowledge base name.

  4. The component queries the knowledge base for semantically similar content based on the flow's input.

Langflow database identity

Langflow components that connect to Postgres use the database credentials configured within the flow, not a system-managed role. The Postgres role used by a Langflow flow determines which knowledge bases and tables it can access through standard Postgres privilege rules. If the configured role does not have SELECT access to the knowledge base's underlying vector table, queries will fail with a permission error. Granting the Langflow role inheritance from visual_pipeline_user (for example, GRANT visual_pipeline_user TO langflow_role) provides access to all VPU-owned objects.

Troubleshooting

No embeddings after pipeline runs

If the source record count is positive but the embedding count remains zero:

  1. Check the pipeline status for errors. "Has Errors" or "Failed" status indicates processing problems.

  2. Verify that the embedding model is reachable from the pipeline's cluster. Hybrid Manager (HM)-hosted and HM-proxied models are only reachable from primary-location clusters. If the pipeline runs on a secondary location or self-managed cluster, those models will fail. See Executing pipelines: How models reach pipeline steps.

  3. If using an HM-hosted model on the primary location, confirm the KServe InferenceService is healthy.

  4. Check that the source data column contains valid content for the configured pipeline steps.

  5. Review the AIDB error log for detailed error messages. Connect to the database and run SELECT * FROM aidb.get_error_logs('your_pipeline_name');.

Unprocessed count not decreasing

If the unprocessed count remains static:

  1. Verify the processing mode isn't On Demand. If On Demand, trigger a manual run or switch to Background or Live.

  2. For Background mode, confirm the sync interval has elapsed since the last data change.

  3. Check for pipeline-blocking errors that halt processing entirely.

Poor retrieval quality

If test queries return irrelevant results:

  1. Review your chunking strategy. Chunks that are too large or too small produce suboptimal embeddings.

  2. Consider a different embedding model. Some models perform better on specific content types.

  3. Verify that the source data is clean and relevant. HTML tags, boilerplate text, or corrupt content degrade embedding quality. Consider adding a ParseHTML step before chunking.