OfficialDatabricks Data & Analytics SQL

Databricks AI Functions

Name: Databricks AI Functions
Author: databricks

By databricks· databricks/databricks-agent-skills· 0

Add AI tasks like classification, extraction, and summarization directly to SQL and PySpark pipelines.

Installation

1
Make sure Claude is on your device and in your terminal.
Skills load from ~/.claude/skills/ when Claude Code starts up — so you need it on your machine first. If you don't have it yet, install it once with the command below, then run claude in any terminal to verify.
One-time setup
```
npm i -g @anthropic-ai/claude-code
```
Already have it? Skip ahead.

Paste into Claude Code or into your terminal.

Install

git clone ht••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••• •• ••••• •• •••••••••••••••••••••••••••••••••••••••• •• •• •• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••

This copies the whole skill folder into ~/.claude/skills/databricks-ai-functions/ — the SKILL.md plus any scripts, reference docs, or templates the skill ships with. Safe default: works for every skill.

Faster alternative (instruction-only skills)

Skips the clone and grabs only the SKILL.md file. Don't use this if the skill ships Python scripts, reference markdowns, or asset templates — they won't be downloaded and the skill will fail when it tries to load them.

Quick install (SKILL.md only)

mkdir -p ~/.••••••••••••••••••••••••••••••••••••• •• •••• ••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• •••••••••••••••••••••••••••••••••••••••••••••••••

Restart Claude Code.
Quit and reopen Claude Code (or any other agent that loads from ~/.claude/skills/). New skills are picked up on startup.
Just ask Claude.
Skills auto-activate when your request matches the skill's description — no slash command needed. Trigger phrases live in the skill's own frontmatter; you can read them in the “What this skill does” section above.

Prefer to read the source first? Open on GitHub.

When Claude uses it

Use Databricks built-in AI Functions (ai_classify, ai_extract, ai_summarize, ai_mask, ai_translate, ai_fix_grammar, ai_gen, ai_analyze_sentiment, ai_similarity, ai_parse_document, ai_query, ai_forecast) to add AI capabilities directly to SQL and PySpark pipelines without managing model endpoints. Also covers document parsing and building custom RAG pipelines (parse → chunk → index → query).

What this skill does

Databricks AI Functions

Official Docs: https://docs.databricks.com/large-language-models/ai-functions Individual function reference: https://docs.databricks.com/sql/language-manual/functions/

Overview

Databricks AI Functions are built-in SQL and PySpark functions that call Foundation Model APIs directly from your data pipelines — no model endpoint setup, no API keys, no boilerplate. They operate on table columns as naturally as UPPER() or LENGTH(), and are optimized for batch inference at scale.

There are three categories:

Category	Functions	Use when
Task-specific	`ai_analyze_sentiment`, `ai_classify`, `ai_extract`, `ai_fix_grammar`, `ai_gen`, `ai_mask`, `ai_similarity`, `ai_summarize`, `ai_translate`, `ai_parse_document`	The task is well-defined — prefer these always
General-purpose	`ai_query`	Complex nested JSON, custom endpoints, multimodal — last resort only
Table-valued	`ai_forecast`	Time series forecasting

Function selection rule — always prefer a task-specific function over ai_query:

Task	Use this	Fall back to `ai_query` when...
Sentiment scoring	`ai_analyze_sentiment`	Never
Fixed-label routing	`ai_classify` (2–500 labels; add descriptions for accuracy)	Never
Entity / field extraction	`ai_extract`	Never
Summarization	`ai_summarize`	Never — use `max_words=0` for uncapped
Grammar correction	`ai_fix_grammar`	Never
Translation	`ai_translate`	Target language not in the supported list
PII redaction	`ai_mask`	Never
Free-form generation	`ai_gen`	Need structured JSON output
Semantic similarity	`ai_similarity`	Never
PDF / document parsing	`ai_parse_document`	Need image-level reasoning
Complex JSON / reasoning	—	This is the intended use case for `ai_query`

Prerequisites

Databricks SQL warehouse (not Classic) or cluster with DBR 15.1+
DBR 15.4 ML LTS recommended for batch workloads
DBR 17.1+ required for ai_parse_document
ai_forecast requires a Pro or Serverless SQL warehouse
Workspace in a supported AWS/Azure region for batch AI inference
Models run under Apache 2.0 or LLAMA 3.3 Community License — customers are responsible for compliance

Quick Start

Classify, extract, and score sentiment from a text column in a single query:

SELECT
    ticket_id,
    ticket_text,
    ai_classify(ticket_text, ARRAY('urgent', 'not urgent', 'spam')) AS priority,
    ai_extract(ticket_text, ARRAY('product', 'error_code', 'date'))  AS entities,
    ai_analyze_sentiment(ticket_text)                                 AS sentiment
FROM support_tickets;

from pyspark.sql.functions import expr

df = spark.table("support_tickets")
df = (
    df.withColumn("priority",  expr("ai_classify(ticket_text, array('urgent', 'not urgent', 'spam'))"))
      .withColumn("entities",  expr("ai_extract(ticket_text, array('product', 'error_code', 'date'))"))
      .withColumn("sentiment", expr("ai_analyze_sentiment(ticket_text)"))
)
# Access nested STRUCT fields from ai_extract
df.select("ticket_id", "priority", "sentiment",
          "entities.product", "entities.error_code", "entities.date").display()

Common Patterns

Pattern 1: Text Analysis Pipeline

Chain multiple task-specific functions to enrich a text column in one pass:

SELECT
    id,
    content,
    ai_analyze_sentiment(content)               AS sentiment,
    ai_summarize(content, 30)                   AS summary,
    ai_classify(content,
        ARRAY('technical', 'billing', 'other')) AS category,
    ai_fix_grammar(content)                     AS content_clean
FROM raw_feedback;

Pattern 2: PII Redaction Before Storage

from pyspark.sql.functions import expr

df_clean = (
    spark.table("raw_messages")
    .withColumn(
        "message_safe",
        expr("ai_mask(message, array('person', 'email', 'phone', 'address'))")
    )
)
df_clean.write.format("delta").mode("append").saveAsTable("catalog.schema.messages_safe")

Pattern 3: Document Ingestion from a Unity Catalog Volume

Parse PDFs/Office docs, then enrich with task-specific functions:

from pyspark.sql.functions import expr

df = (
    spark.read.format("binaryFile")
    .load("/Volumes/catalog/schema/landing/documents/")
    .withColumn("parsed", expr("ai_parse_document(content)"))
    .selectExpr("path",
                "parsed:pages[*].elements[*].content AS text_blocks",
                "parsed:error AS parse_error")
    .filter("parse_error IS NULL")
    .withColumn("summary",  expr("ai_summarize(text_blocks, 50)"))
    .withColumn("entities", expr("ai_extract(text_blocks, array('date', 'amount', 'vendor'))"))
)

Pattern 4: Semantic Matching / Deduplication

-- Find near-duplicate company names
SELECT a.id, b.id, ai_similarity(a.name, b.name) AS score
FROM companies a
JOIN companies b ON a.id < b.id
WHERE ai_similarity(a.name, b.name) > 0.85;

Pattern 5: Complex JSON Extraction with `ai_query` (last resort)

Use only when the output schema has nested arrays or requires multi-step reasoning that no task-specific function handles:

from pyspark.sql.functions import expr, from_json, col

df = (
    spark.table("parsed_documents")
    .withColumn("ai_response", expr("""
        ai_query(
            'databricks-claude-sonnet-4',
            concat('Extract invoice as JSON with nested itens array: ', text_blocks),
            responseFormat => '{"type":"json_object"}',
            failOnError     => false
        )
    """))
    .withColumn("invoice", from_json(
        col("ai_response.response"),
        "STRUCT<numero:STRING, total:DOUBLE, "
        "itens:ARRAY<STRUCT<codigo:STRING, descricao:STRING, qtde:DOUBLE, vlrUnit:DOUBLE>>>"
    ))
)

Pattern 6: Time Series Forecasting

SELECT *
FROM ai_forecast(
    observed  => TABLE(SELECT date, sales FROM daily_sales),
    horizon   => '2026-12-31',
    time_col  => 'date',
    value_col => 'sales'
);
-- Returns: date, sales_forecast, sales_upper, sales_lower

Reference Files

references/1-task-functions.md — Full syntax, parameters, SQL + PySpark examples for all 9 task-specific functions (ai_analyze_sentiment, ai_classify, ai_extract, ai_fix_grammar, ai_gen, ai_mask, ai_similarity, ai_summarize, ai_translate) and ai_parse_document
references/2-ai-query.md — ai_query complete reference: all parameters, structured output with responseFormat, multimodal files =>, UDF patterns, and error handling
references/3-ai-forecast.md — ai_forecast parameters, single-metric, multi-group, multi-metric, and confidence interval patterns
references/4-document-processing-pipeline.md — End-to-end batch document processing pipeline using AI Functions in a Lakeflow Declarative Pipeline; includes config.yml centralization, function selection logic, custom RAG pipeline (parse → chunk → Vector Search), and DSPy/LangChain guidance for near-real-time variants

Common Issues

Issue	Solution
`ai_parse_document` not found	Requires DBR 17.1+. Check cluster runtime.
`ai_forecast` fails	Requires Pro or Serverless SQL warehouse — not available on Classic or Starter.
All functions return NULL	Input column is NULL. Filter with `WHERE col IS NOT NULL` before calling.
`ai_translate` fails for a language	Supported: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai. Use `ai_query` with a multilingual model for others.
`ai_classify` returns unexpected labels	Use clear, mutually exclusive label names. Fewer labels (2–5) produces more reliable results.
`ai_query` raises on some rows in a batch job	Add `failOnError => false` — returns a STRUCT with `.response` and `.error` instead of raising.
Batch job runs slowly	Use DBR 15.4 ML LTS cluster (not serverless or interactive) for optimized batch inference throughput.
Want to swap models without editing pipeline code	Store all model names and prompts in `config.yml` — see references/4-document-processing-pipeline.md for the pattern.

Related skills

Databricks Core

databricks

Authenticate, configure, and explore data with Databricks CLI commands.

Official

Databricks DABs Manager

databricks

Create, configure, and deploy Databricks Declarative Automation Bundles for dashboards, jobs, and pipelines.

Official

Databricks Jobs

databricks

Create and deploy data engineering jobs on Databricks using notebooks, Python, SQL, or pipelines.

Official

Databricks Pipelines

databricks

Build batch or streaming data pipelines on Databricks with Python or SQL.

Official

Installation

When Claude uses it

What this skill does

Databricks AI Functions

Overview

Prerequisites

Quick Start

Common Patterns

Pattern 1: Text Analysis Pipeline

Pattern 2: PII Redaction Before Storage

Pattern 3: Document Ingestion from a Unity Catalog Volume

Pattern 4: Semantic Matching / Deduplication

Pattern 5: Complex JSON Extraction with ai_query (last resort)

Pattern 6: Time Series Forecasting

Reference Files

Common Issues

Related skills

Databricks Core

Databricks DABs Manager

Databricks Jobs

Databricks Pipelines

Pattern 5: Complex JSON Extraction with `ai_query` (last resort)