AugmentClaude

MLflow GenAI Evaluation

Set up and run AI evaluation pipelines with MLflow scorers and automated prompt optimization.

Installation

  1. Make sure Claude is on your device and in your terminal.

    Skills load from ~/.claude/skills/ when Claude Code starts up — so you need it on your machine first. If you don't have it yet, install it once with the command below, then run claude in any terminal to verify.

    One-time setup
    npm i -g @anthropic-ai/claude-code

    Already have it? Skip ahead.

  2. Paste into Claude Code or into your terminal.

    This copies the whole skill folder into ~/.claude/skills/databricks-mlflow-evaluation/ — the SKILL.md plus any scripts, reference docs, or templates the skill ships with. Safe default: works for every skill.

    Faster alternative (instruction-only skills)

    Skips the clone and grabs only the SKILL.md file. Don't use this if the skill ships Python scripts, reference markdowns, or asset templates — they won't be downloaded and the skill will fail when it tries to load them.

    Quick install (SKILL.md only)
    Sign up to copy
  3. Restart Claude Code.

    Quit and reopen Claude Code (or any other agent that loads from ~/.claude/skills/). New skills are picked up on startup.

  4. Just ask Claude.

    Skills auto-activate when your request matches the skill's description — no slash command needed. Trigger phrases live in the skill's own frontmatter; you can read them in the “What this skill does” section above.

Prefer to read the source first? Open on GitHub.

When Claude uses it

MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.

What this skill does

MLflow 3 GenAI Evaluation

Scope vs upstream mlflow/skills

The OSS mlflow/skills repo ships agent-evaluation and related skills (instrumenting-with-mlflow-tracing, analyze-mlflow-trace, retrieving-mlflow-traces, querying-mlflow-metrics) that cover the generic MLflow GenAI evaluation workflow — mlflow.genai.evaluate(), scorers/judges, datasets, tracing setup, and the 5-step evaluation loop.

This skill layers Databricks-specific patterns on top of that workflow rather than restating it. Use this skill when you need any of:

  • Unity Catalog trace ingestion — production traces written into UC tables, log-based monitoring (patterns-trace-ingestion.md).
  • MemAlign judge alignment via UC SME labeling sessions — aligning custom judges against domain-expert feedback collected in Databricks (patterns-judge-alignment.md).
  • optimize_prompts() GEPA loop — Databricks' automated prompt-optimization driver running on a UC dataset (patterns-prompt-optimization.md).
  • Databricks-flavored scorer/dataset patterns — UC-table-backed datasets, tagging traces in the Databricks UI for inclusion (patterns-datasets.md, patterns-scorers.md).

For everything else — generic mlflow.genai.evaluate() calls, scorer authoring patterns, dataset creation outside Databricks, MLflow tracing setup that isn't UC-table-bound — the upstream mlflow/skills/agent-evaluation skill is the canonical source and is kept current by the MLflow team.

Before Writing Any Code

  1. Read GOTCHAS.md - 15+ common mistakes that cause failures
  2. Read CRITICAL-interfaces.md - Exact API signatures and data schemas

End-to-End Workflows

Follow these workflows based on your goal. Each step indicates which reference files to read.

Workflow 1: First-Time Evaluation Setup

For users new to MLflow GenAI evaluation or setting up evaluation for a new agent.

StepActionReference Files
1Understand what to evaluateuser-journeys.md (Journey 0: Strategy)
2Learn API patternsGOTCHAS.md + CRITICAL-interfaces.md
3Build initial datasetpatterns-datasets.md (Patterns 1-4)
4Choose/create scorerspatterns-scorers.md + CRITICAL-interfaces.md (built-in list)
5Run evaluationpatterns-evaluation.md (Patterns 1-3)

Workflow 2: Production Trace -> Evaluation Dataset

For building evaluation datasets from production traces.

StepActionReference Files
1Search and filter tracespatterns-trace-analysis.md (MCP tools section)
2Analyze trace qualitypatterns-trace-analysis.md (Patterns 1-7)
3Tag traces for inclusionpatterns-datasets.md (Patterns 16-17)
4Build dataset from tracespatterns-datasets.md (Patterns 6-7)
5Add expectations/ground truthpatterns-datasets.md (Pattern 2)

Workflow 3: Performance Optimization

For debugging slow or expensive agent execution.

StepActionReference Files
1Profile latency by spanpatterns-trace-analysis.md (Patterns 4-6)
2Analyze token usagepatterns-trace-analysis.md (Pattern 9)
3Detect context issuespatterns-context-optimization.md (Section 5)
4Apply optimizationspatterns-context-optimization.md (Sections 1-4, 6)
5Re-evaluate to measure impactpatterns-evaluation.md (Pattern 6-7)

Workflow 4: Regression Detection

For comparing agent versions and finding regressions.

StepActionReference Files
1Establish baselinepatterns-evaluation.md (Pattern 4: named runs)
2Run current versionpatterns-evaluation.md (Pattern 1)
3Compare metricspatterns-evaluation.md (Patterns 6-7)
4Analyze failing tracespatterns-trace-analysis.md (Pattern 7)
5Debug specific failurespatterns-trace-analysis.md (Patterns 8-9)

Workflow 5: Custom Scorer Development

For creating project-specific evaluation metrics.

StepActionReference Files
1Understand scorer interfaceCRITICAL-interfaces.md (Scorer section)
2Choose scorer patternpatterns-scorers.md (Patterns 4-11)
3For multi-agent scorerspatterns-scorers.md (Patterns 13-16)
4Test with evaluationpatterns-evaluation.md (Pattern 1)

Workflow 6: Unity Catalog Trace Ingestion & Production Monitoring

For storing traces in Unity Catalog, instrumenting applications, and enabling continuous production monitoring.

StepActionReference Files
1Link UC schema to experimentpatterns-trace-ingestion.md (Patterns 1-2)
2Set trace destinationpatterns-trace-ingestion.md (Patterns 3-4)
3Instrument your applicationpatterns-trace-ingestion.md (Patterns 5-8)
4Configure trace sources (Apps/Serving/OTEL)patterns-trace-ingestion.md (Patterns 9-11)
5Enable production monitoringpatterns-trace-ingestion.md (Patterns 12-13)
6Query and analyze UC tracespatterns-trace-ingestion.md (Pattern 14)

Workflow 7: Judge Alignment with MemAlign

For aligning an LLM judge to match domain expert preferences. A well-aligned judge improves every downstream use: evaluation accuracy, production monitoring signal, and prompt optimization quality. This workflow is valuable on its own, independent of prompt optimization.

StepActionReference Files
1Design base judge with make_judge (any feedback type)patterns-judge-alignment.md (Pattern 1)
2Run evaluate(), tag successful tracespatterns-judge-alignment.md (Pattern 2)
3Build UC dataset + create SME labeling sessionpatterns-judge-alignment.md (Pattern 3)
4Align judge with MemAlign after labeling completespatterns-judge-alignment.md (Pattern 4)
5Register aligned judge to experimentpatterns-judge-alignment.md (Pattern 5)
6Re-evaluate with aligned judge (baseline)patterns-judge-alignment.md (Pattern 6)

Workflow 8: Automated Prompt Optimization with GEPA

For automatically improving a registered system prompt using optimize_prompts(). Works with any scorer, but paired with an aligned judge (Workflow 7) gives the most domain-accurate signal. For the full end-to-end loop combining alignment and optimization, see user-journeys.md Journey 10.

StepActionReference Files
1Build optimization dataset (inputs + expectations)patterns-prompt-optimization.md (Pattern 1)
2Run optimize_prompts() with GEPA + scorerpatterns-prompt-optimization.md (Pattern 2)
3Register new version, promote conditionallypatterns-prompt-optimization.md (Pattern 3)

Reference Files Quick Lookup

ReferencePurposeWhen to Read
GOTCHAS.mdCommon mistakesAlways read first before writing code
CRITICAL-interfaces.mdAPI signatures, schemasWhen writing any evaluation code
patterns-evaluation.mdRunning evals, comparingWhen executing evaluations
patterns-scorers.mdCustom scorer creationWhen built-in scorers aren't enough
patterns-datasets.mdDataset buildingWhen preparing evaluation data
patterns-trace-analysis.mdTrace debuggingWhen analyzing agent behavior
patterns-context-optimization.mdToken/latency fixesWhen agent is slow or expensive
patterns-trace-ingestion.mdUC trace setup, monitoringWhen setting up trace storage or production monitoring
patterns-judge-alignment.mdMemAlign judge alignment, labeling sessions, SME feedbackWhen aligning judges to domain expert preferences
patterns-prompt-optimization.mdGEPA optimization: build dataset, optimize_prompts(), promoteWhen running automated prompt improvement
user-journeys.mdHigh-level workflows, full domain-expert optimization loopWhen starting a new evaluation project or running the full align + optimize cycle

Critical API Facts

  • Use: mlflow.genai.evaluate() (NOT mlflow.evaluate())
  • Data format: {"inputs": {"query": "..."}} (nested structure required)
  • predict_fn: Receives **unpacked kwargs (not a dict)
  • MemAlign: Scorer-agnostic (works with any feedback_value_type -- float, bool, categorical); token-heavy on the embedding model so set embedding_model explicitly
  • Label schema name matching: The label schema name in the labeling session MUST match the judge name used in evaluate() for align() to pair scores
  • Aligned judge scores: May be lower than unaligned judge scores -- this is expected and means the judge is now more accurate, not that the agent regressed
  • GEPA optimization dataset: Must have both inputs AND expectations per record (different from eval dataset)
  • Episodic memory: Lazily loaded -- get_scorer() results won't show episodic memory on print until the judge is first used
  • optimize_prompts: Requires MLflow >= 3.5.0

See GOTCHAS.md for complete list.

Related Skills

Related skills