Quickstart — Use an Inference Endpoint with Python Innovation Release
This documentation covers the current Innovation Release of
EDB Postgres AI. You may also want the docs for the current LTS version.
Use this quickstart to chat with a model deployed behind your private InferenceService (KServe). You pass the endpoint URL and access key, plus the model name. The example uses a production‑ready, single‑file Python script.
Time to complete: 5–10 minutes
Goals
- Call a private, governed model endpoint (OpenAI‑compatible) over HTTP(s)
- Understand required headers and payload shape for chat completions
- Run a minimal demo script you can reuse in apps and CI/CD
Prerequisites
- An InferenceService is deployed and ready. See:
- External access path and access key, or an internal cluster‑local path. See:
- Python 3.9+ and
httpxinstalled:pip install httpx
Environment
Set these environment variables before running the script:
export EDB_API_URL="https://<portal>/inferenceservices/<inferenceservice-id>" export EDB_API_TOKEN="<hm-user-access-key>" export MODEL_NAME="meta/llama-3.1-8b-instruct" # or the model you serve
Notes:
- For internal callers, set
EDB_BASE_URLto your cluster‑local path (proxy or direct KServe URL). - The endpoint path for chat is
/v1/chat/completionsappended toEDB_BASE_URL.
Run the demo script
Download and run the production‑ready single file:
- Script: hm_kserve_quickstart.py
Example (custom prompt):
python hm_kserve_quickstart.py chat --prompt "Write a haiku about Postgres and GPUs."
Optional (use Hybrid Manager to list clusters and summarize):
pip install httpx typer export EDB_API_URL="https://<hm-host>"; export EDB_API_TOKEN="<hm-access-token>"; export PROJECT_ID="<uuid>" python hm_kserve_quickstart.py summarize-clusters --project-id "$PROJECT_ID"
Request and headers (reference)
- Endpoint:
POST ${EDB_BASE_URL}/v1/chat/completions - Headers:
Authorization: Bearer ${EDB_API_KEY},Accept: application/json,Content-Type: application/json - Body (simplified):
{ "model": "${MODEL_NAME}", "messages": [ {"role": "user", "content": "Hello"} ], "max_tokens": 256 }
Embeddings and rerank models use different paths:
- Embeddings:
${EDB_BASE_URL}/v1/embeddings - Rerank:
${EDB_BASE_URL}/v1/ranking
If you see HTTP 404 Not Found, verify the operation‑specific path.
Best practices
- Keep inference sovereign: prefer internal paths when the caller runs in‑cluster.
- Rotate access keys regularly; never commit them to source control.
- Enforce TLS and limit egress from clients.
- Monitor latency and error rates; scale resources or tune concurrency as needed.
Troubleshooting
- 401/403: Verify
EDB_API_KEYand user permissions to the InferenceService. - 404: Confirm the InferenceService ID and that the service is ready.
- Timeouts/5xx: Check KServe pod status, health probes, and logs; validate endpoint path.
Next steps
- Build an app using the same headers and payloads; see the Python client quickstart.
- Integrate with Gen AI Assistants and Knowledge Bases for RAG; see Gen AI and Pipelines.
- Add observability and SLOs; see Model observability and Update GPU resources.
Known issues in 1.3
- Some internal model URLs shown in listings may omit the operation suffix (for example, embeddings require
/v1/embeddings). If a call returns 404, append the appropriate suffix. This will be addressed in a future release. - External access requires a valid Hybrid Manager user access key with the right role (for example, Gen AI Builder User). A malformed key or insufficient permissions return HTTP 401.