Quickstart — Use an Inference Endpoint with Python Innovation Release

Suggest edits

This documentation covers the current Innovation Release of EDB Postgres AI. You may also want the docs for the current LTS version.

Use this quickstart to chat with a model deployed behind your private InferenceService (KServe). You pass the endpoint URL and access key, plus the model name. The example uses a production‑ready, single‑file Python script.

Time to complete: 5–10 minutes

Goals

Call a private, governed model endpoint (OpenAI‑compatible) over HTTP(s)
Understand required headers and payload shape for chat completions
Run a minimal demo script you can reuse in apps and CI/CD

Prerequisites

An InferenceService is deployed and ready. See:
- Model Serving
- Create an InferenceService
External access path and access key, or an internal cluster‑local path. See:
- Access KServe endpoints
Python 3.9+ and httpx installed: pip install httpx

Environment

Set these environment variables before running the script:

export EDB_API_URL="https://<portal>/inferenceservices/<inferenceservice-id>"
export EDB_API_TOKEN="<hm-user-access-key>"
export MODEL_NAME="meta/llama-3.1-8b-instruct"  # or the model you serve

Notes:

For internal callers, set EDB_BASE_URL to your cluster‑local path (proxy or direct KServe URL).
The endpoint path for chat is /v1/chat/completions appended to EDB_BASE_URL.

Run the demo script

Download and run the production‑ready single file:

Script: hm_kserve_quickstart.py

Example (custom prompt):

python hm_kserve_quickstart.py chat --prompt "Write a haiku about Postgres and GPUs."

Optional (use Hybrid Manager to list clusters and summarize):

pip install httpx typer
export EDB_API_URL="https://<hm-host>"; export EDB_API_TOKEN="<hm-access-token>"; export PROJECT_ID="<uuid>"
python hm_kserve_quickstart.py summarize-clusters --project-id "$PROJECT_ID"

Request and headers (reference)

Endpoint: POST ${EDB_BASE_URL}/v1/chat/completions
Headers: Authorization: Bearer ${EDB_API_KEY}, Accept: application/json, Content-Type: application/json
Body (simplified):

{
  "model": "${MODEL_NAME}",
  "messages": [
    {"role": "user", "content": "Hello"}
  ],
  "max_tokens": 256
}

Embeddings and rerank models use different paths:

Embeddings: ${EDB_BASE_URL}/v1/embeddings
Rerank: ${EDB_BASE_URL}/v1/ranking

If you see HTTP 404 Not Found, verify the operation‑specific path.

Best practices

Keep inference sovereign: prefer internal paths when the caller runs in‑cluster.
Rotate access keys regularly; never commit them to source control.
Enforce TLS and limit egress from clients.
Monitor latency and error rates; scale resources or tune concurrency as needed.

Troubleshooting

401/403: Verify EDB_API_KEY and user permissions to the InferenceService.
404: Confirm the InferenceService ID and that the service is ready.
Timeouts/5xx: Check KServe pod status, health probes, and logs; validate endpoint path.

Next steps

Build an app using the same headers and payloads; see the Python client quickstart.
Integrate with Gen AI Assistants and Knowledge Bases for RAG; see Gen AI and Pipelines.
Add observability and SLOs; see Model observability and Update GPU resources.

Known issues in 1.3

Some internal model URLs shown in listings may omit the operation suffix (for example, embeddings require /v1/embeddings). If a call returns 404, append the appropriate suffix. This will be addressed in a future release.
External access requires a valid Hybrid Manager user access key with the right role (for example, Gen AI Builder User). A malformed key or insufficient permissions return HTTP 401.

↑ Up

AI Factory Models

Accessing KServe endpoints (internal and external)