Databricks Model Serving
Create and manage scalable REST API endpoints for LLMs and ML models.
Installation
- Make sure Claude is on your device and in your terminal.
Skills load from
~/.claude/skills/when Claude Code starts up — so you need it on your machine first. If you don't have it yet, install it once with the command below, then runclaudein any terminal to verify.One-time setupnpm i -g @anthropic-ai/claude-codeAlready have it? Skip ahead.
- Paste into Claude Code or into your terminal.
This copies the whole skill folder into
~/.claude/skills/databricks-model-serving/— the SKILL.md plus any scripts, reference docs, or templates the skill ships with. Safe default: works for every skill.Faster alternative (instruction-only skills)
Skips the clone and grabs only the SKILL.md file. Don't use this if the skill ships Python scripts, reference markdowns, or asset templates — they won't be downloaded and the skill will fail when it tries to load them.
Quick install (SKILL.md only)Sign up to copy - Restart Claude Code.
Quit and reopen Claude Code (or any other agent that loads from
~/.claude/skills/). New skills are picked up on startup. - Just ask Claude.
Skills auto-activate when your request matches the skill's description — no slash command needed. Trigger phrases live in the skill's own frontmatter; you can read them in the “What this skill does” section above.
Prefer to read the source first? Open on GitHub.
When Claude uses it
Manage Databricks Model Serving endpoints via CLI. Use when asked to create, configure, query, or manage model serving endpoints for LLM inference, custom models, or external models.
What this skill does
Model Serving Endpoints
FIRST: Use the parent databricks-core skill for CLI basics, authentication, and profile selection.
Model Serving provides managed endpoints for serving LLMs, custom ML models, and external models as scalable REST APIs. Endpoints are identified by name (unique per workspace).
Endpoint Types
| Type | When to Use | Key Detail |
|---|---|---|
| Pay-per-token | Foundation Model APIs (Llama, DBRX, etc.) | Uses system.ai.* catalog models, simplest setup |
| Provisioned throughput | Dedicated GPU capacity | Guaranteed throughput, higher cost |
| Custom model | Your own MLflow models or containers | Deploy any model with an MLflow signature |
Endpoint Structure
Serving Endpoint (top-level, identified by NAME)
├── Config
│ ├── Served Entities (model references + scaling config)
│ └── Traffic Config (routing percentages across entities)
├── AI Gateway (rate limits, usage tracking)
└── State (READY / NOT_READY, config_update status)
- Served Entities: Each entity references a model (from Unity Catalog or MLflow) with scaling parameters. Get the entity name from
served_entities[].namein thegetoutput — needed forbuild-logsandlogscommands. - Traffic Config: Routes requests across served entities by percentage (for A/B testing, canary deployments).
- State: Endpoints transition
NOT_READY→READYafter creation or config update. Poll viagetto checkstate.ready.
CLI Discovery — ALWAYS Do This First
Do NOT guess command syntax. Discover available commands and their usage dynamically:
# List all serving-endpoints subcommands
databricks serving-endpoints -h
# Get detailed usage for any subcommand (flags, args, JSON fields)
databricks serving-endpoints <subcommand> -h
Run databricks serving-endpoints -h before constructing any command. Run databricks serving-endpoints <subcommand> -h to discover exact flags, positional arguments, and JSON spec fields for that subcommand.
Create an Endpoint
Do NOT list endpoints before creating.
databricks serving-endpoints create <ENDPOINT_NAME> \
--json '{
"served_entities": [{
"entity_name": "<MODEL_CATALOG_PATH>",
"entity_version": "<VERSION>",
"min_provisioned_throughput": 0,
"max_provisioned_throughput": 0,
"workload_size": "Small",
"scale_to_zero_enabled": true
}],
"traffic_config": {
"routes": [{
"served_entity_name": "<ENTITY_NAME>",
"traffic_percentage": 100
}]
}
}' --profile <PROFILE>
- Discover available Foundation Models: check the
system.aicatalog in Unity Catalog, or usedatabricks serving-endpoints list --profile <PROFILE>to see available endpoints. Usedatabricks serving-endpoints get-open-api <ENDPOINT_NAME> --profile <PROFILE>to inspect the endpoint's API schema. - Long-running operation; the CLI waits for completion by default. Use
--no-waitto return immediately, then poll:databricks serving-endpoints get <ENDPOINT_NAME> --profile <PROFILE> # Check: state.ready == "READY" - For provisioned throughput or custom model endpoints, run
databricks serving-endpoints create -hto discover the required JSON fields for your endpoint type.
Endpoint Readiness
After create or update-config, the endpoint provisions compute and loads the model. Do not query the endpoint until it is ready.
Poll for readiness:
databricks serving-endpoints get <ENDPOINT_NAME> --profile <PROFILE> -o json
# Ready when: state.ready == "READY" AND state.config_update == "NOT_UPDATING"
Provisioning may take several minutes. Provisioned throughput endpoints take the longest (GPU allocation). Queries to endpoints that are not yet READY return 404 or 503 errors.
Query an Endpoint
databricks serving-endpoints query <ENDPOINT_NAME> \
--json '{"messages": [{"role": "user", "content": "Hello, how are you?"}]}' \
--profile <PROFILE>
- Use
--streamfor streaming responses. - For non-chat endpoints (embeddings, custom models): use
get-open-api <ENDPOINT_NAME>first to discover the request/response schema, then construct the appropriate JSON payload.
Get Endpoint Schema (OpenAPI)
Returns the OpenAPI 3.1 JSON schema describing what each served model accepts and returns. Use this to understand an endpoint's input/output format before querying it.
databricks serving-endpoints get-open-api <ENDPOINT_NAME> --profile <PROFILE>
The schema shows paths per served model (e.g., /served-models/<model-name>/invocations) with full request/response definitions including parameter types, enums, and nullable fields.
Other Commands
Run databricks serving-endpoints <subcommand> -h for usage details.
| Task | Command | Notes |
|---|---|---|
| List all endpoints | list | |
| Get endpoint details | get <NAME> | Shows state, config, served entities |
| Delete endpoint | delete <NAME> | |
| Update served entities or traffic | update-config <NAME> --json '...' | Zero-downtime: old config serves until new is ready |
| Rate limits & usage tracking | put-ai-gateway <NAME> --json '...' | |
| Update tags | patch <NAME> --json '...' | |
| Build logs | build-logs <NAME> <SERVED_MODEL> | Get SERVED_MODEL from get output: served_entities[].name |
| Runtime logs | logs <NAME> <SERVED_MODEL> | |
| Metrics (Prometheus format) | export-metrics <NAME> | |
| Permissions | get-permissions <ENDPOINT_ID> | ⚠️ Uses endpoint ID (hex string), not name. Find ID via get. |
What's Next
Integrate with a Databricks App
After creating a serving endpoint, wire it into a Databricks App.
Step 1 — Check if the serving plugin is available in the AppKit template:
databricks apps manifest --profile <PROFILE>
If the output includes a serving plugin, scaffold with:
databricks apps init --name <APP_NAME> \
--features serving \
--set "serving.serving-endpoint.name=<ENDPOINT_NAME>" \
--run none --profile <PROFILE>
Step 2 — If no serving plugin, add the endpoint resource manually to an existing app's databricks.yml:
resources:
apps:
my_app:
resources:
- name: my-model-endpoint
serving_endpoint:
name: <ENDPOINT_NAME>
permission: CAN_QUERY
And inject the endpoint name as an environment variable in app.yaml:
env:
- name: SERVING_ENDPOINT
valueFrom: serving-endpoint
Then add a tRPC route to call it from your app. For the full app integration pattern, use the databricks-apps skill and read the Model Serving Guide.
Troubleshooting
| Error | Solution |
|---|---|
cannot configure default credentials | Use --profile flag or authenticate first |
PERMISSION_DENIED | Check workspace permissions; for apps, ensure serving_endpoint resource declared with CAN_QUERY |
Endpoint stuck in NOT_READY | Wait up to 30 min for provisioned throughput. Check build logs: build-logs <NAME> <ENTITY_NAME> (get entity name from get output → served_entities[].name) |
RESOURCE_DOES_NOT_EXIST | Verify endpoint name with list |
| Query returns 404 | Endpoint may still be provisioning; check state.ready via get |
RATE_LIMIT_EXCEEDED (429) | AI Gateway rate limit; check put-ai-gateway config or retry after backoff |
Related skills
Databricks Core
databricks
Authenticate, configure, and explore data with Databricks CLI commands.
Databricks DABs Manager
databricks
Create, configure, and deploy Databricks Declarative Automation Bundles for dashboards, jobs, and pipelines.
Databricks Jobs
databricks
Create and deploy data engineering jobs on Databricks using notebooks, Python, SQL, or pipelines.
Databricks Pipelines
databricks
Build batch or streaming data pipelines on Databricks with Python or SQL.