Databricks Reference Architecture
Design production-ready Databricks projects using best practices and the medallion pattern.
Installation
- Make sure Claude is on your device and in your terminal.
Skills load from
~/.claude/skills/when Claude Code starts up — so you need it on your machine first. If you don't have it yet, install it once with the command below, then runclaudein any terminal to verify.One-time setupnpm i -g @anthropic-ai/claude-codeAlready have it? Skip ahead.
- Paste into Claude Code or into your terminal.
This copies the whole skill folder into
~/.claude/skills/databricks-reference-architecture-jeremylongshore/— the SKILL.md plus any scripts, reference docs, or templates the skill ships with. Safe default: works for every skill.Faster alternative (instruction-only skills)
Skips the clone and grabs only the SKILL.md file. Don't use this if the skill ships Python scripts, reference markdowns, or asset templates — they won't be downloaded and the skill will fail when it tries to load them.
Quick install (SKILL.md only)Sign up to copy - Restart Claude Code.
Quit and reopen Claude Code (or any other agent that loads from
~/.claude/skills/). New skills are picked up on startup. - Just ask Claude.
Skills auto-activate when your request matches the skill's description — no slash command needed. Trigger phrases live in the skill's own frontmatter; you can read them in the “What this skill does” section above.
Prefer to read the source first? Open on GitHub.
When Claude uses it
Implement Databricks reference architecture with best-practice project layout. Use when designing new Databricks projects, reviewing architecture, or establishing standards for Databricks applications. Trigger with phrases like "databricks architecture", "databricks best practices", "databricks project structure", "how to organize databricks", "databricks layout".
What this skill does
Databricks Reference Architecture
Overview
Production-ready lakehouse architecture with Unity Catalog, Delta Lake, and the medallion pattern. Covers workspace organization, three-level namespace governance, compute strategy, CI/CD with Asset Bundles, and project structure for team collaboration.
Prerequisites
- Databricks workspace with Unity Catalog enabled
- Understanding of medallion architecture (bronze/silver/gold)
- Databricks CLI configured
- Terraform or Asset Bundles for infrastructure
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ UNITY CATALOG │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌───────────┐ │
│ │ Bronze │ │ Silver │ │ Gold │ │ ML Models │ │
│ │ Catalog │─▶│ Catalog │─▶│ Catalog │ │ (MLflow) │ │
│ │ (raw) │ │ (clean) │ │ (curated) │ │ │ │
│ └────────────┘ └────────────┘ └────────────┘ └───────────┘ │
│ ▲ │ │
│ ┌────────────┐ ┌────────────────┐ │
│ │ Auto Loader│ │ Model Serving │ │
│ │ Ingestion │ │ Endpoints │ │
│ └────────────┘ └────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Compute: Job Clusters │ SQL Warehouses │ Instance Pools │
├─────────────────────────────────────────────────────────────────┤
│ Security: Row Filters │ Column Masks │ Secret Scopes │ SCIM │
├─────────────────────────────────────────────────────────────────┤
│ CI/CD: Asset Bundles │ GitHub Actions │ dev/staging/prod │
└─────────────────────────────────────────────────────────────────┘
Project Structure
databricks-platform/
├── src/
│ ├── ingestion/
│ │ ├── bronze_raw_events.py # Auto Loader streaming
│ │ ├── bronze_api_data.py # REST API batch ingestion
│ │ └── bronze_file_uploads.py # Manual file uploads
│ ├── transformation/
│ │ ├── silver_clean_events.py # Cleansing + dedup
│ │ ├── silver_schema_enforce.py # Schema validation
│ │ └── silver_scd2.py # Slowly changing dimensions
│ ├── aggregation/
│ │ ├── gold_daily_metrics.py # Business KPIs
│ │ ├── gold_user_features.py # ML feature engineering
│ │ └── gold_reporting.py # BI-ready views
│ └── ml/
│ ├── training/
│ │ └── train_churn_model.py
│ └── inference/
│ └── batch_scoring.py
├── tests/
│ ├── conftest.py # Spark fixtures
│ ├── unit/ # Local Spark tests
│ └── integration/ # Databricks Connect tests
├── resources/
│ ├── etl_jobs.yml # ETL job definitions
│ ├── ml_jobs.yml # ML pipeline definitions
│ └── maintenance.yml # OPTIMIZE/VACUUM schedules
├── databricks.yml # Asset Bundle root config
├── pyproject.toml
└── requirements.txt
Instructions
Step 1: Unity Catalog Hierarchy
-- One catalog per environment (or shared with schema isolation)
CREATE CATALOG IF NOT EXISTS dev_catalog;
CREATE CATALOG IF NOT EXISTS prod_catalog;
-- Medallion schemas per catalog
CREATE SCHEMA IF NOT EXISTS prod_catalog.bronze;
CREATE SCHEMA IF NOT EXISTS prod_catalog.silver;
CREATE SCHEMA IF NOT EXISTS prod_catalog.gold;
CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_features;
CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_models;
-- Permissions: engineers write bronze/silver, analysts read gold
GRANT USAGE ON CATALOG prod_catalog TO `data-engineers`;
GRANT CREATE, MODIFY, SELECT ON SCHEMA prod_catalog.bronze TO `data-engineers`;
GRANT CREATE, MODIFY, SELECT ON SCHEMA prod_catalog.silver TO `data-engineers`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `data-engineers`;
GRANT USAGE ON CATALOG prod_catalog TO `data-analysts`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `data-analysts`;
Step 2: Asset Bundle Configuration
# databricks.yml
bundle:
name: data-platform
workspace:
host: ${DATABRICKS_HOST}
include:
- resources/*.yml
variables:
catalog:
default: dev_catalog
alert_email:
default: dev@company.com
targets:
dev:
default: true
mode: development
workspace:
root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev
staging:
variables:
catalog: staging_catalog
prod:
mode: production
variables:
catalog: prod_catalog
alert_email: oncall@company.com
workspace:
root_path: /Shared/.bundle/${bundle.name}/prod
Step 3: Compute Strategy
# resources/etl_jobs.yml
resources:
jobs:
daily_etl:
name: "daily-etl-${bundle.target}"
schedule:
quartz_cron_expression: "0 0 6 * * ?"
timezone_id: "UTC"
max_concurrent_runs: 1
tasks:
- task_key: bronze
notebook_task:
notebook_path: src/ingestion/bronze_raw_events.py
job_cluster_key: etl
- task_key: silver
depends_on: [{task_key: bronze}]
notebook_task:
notebook_path: src/transformation/silver_clean_events.py
job_cluster_key: etl
- task_key: gold
depends_on: [{task_key: silver}]
notebook_task:
notebook_path: src/aggregation/gold_daily_metrics.py
job_cluster_key: etl
job_clusters:
- job_cluster_key: etl
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "i3.xlarge"
autoscale:
min_workers: 1
max_workers: 4
aws_attributes:
availability: SPOT_WITH_FALLBACK
first_on_demand: 1
spark_conf:
spark.databricks.delta.optimizeWrite.enabled: "true"
spark.databricks.delta.autoCompact.enabled: "true"
Step 4: Medallion Pipeline Pattern
# src/ingestion/bronze_raw_events.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name
spark = SparkSession.builder.getOrCreate()
# Bronze: Auto Loader for incremental file ingestion
raw = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/checkpoints/bronze/events/schema")
.option("cloudFiles.inferColumnTypes", "true")
.load("s3://data-lake/raw/events/")
.withColumn("_ingested_at", current_timestamp())
.withColumn("_source_file", input_file_name())
)
(raw.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/checkpoints/bronze/events/data")
.toTable("prod_catalog.bronze.raw_events"))
Step 5: Table Maintenance Schedule
# resources/maintenance.yml
resources:
jobs:
weekly_optimize:
name: "maintenance-optimize-${bundle.target}"
schedule:
quartz_cron_expression: "0 0 2 ? * SUN"
timezone_id: "UTC"
tasks:
- task_key: optimize_tables
notebook_task:
notebook_path: src/maintenance/optimize_tables.py
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "m5.xlarge"
num_workers: 1
# src/maintenance/optimize_tables.py
tables_to_optimize = [
("prod_catalog.silver.orders", ["order_date", "region"]),
("prod_catalog.silver.events", ["event_date"]),
("prod_catalog.gold.daily_metrics", []),
]
for table, z_cols in tables_to_optimize:
if z_cols:
spark.sql(f"OPTIMIZE {table} ZORDER BY ({', '.join(z_cols)})")
else:
spark.sql(f"OPTIMIZE {table}")
spark.sql(f"VACUUM {table} RETAIN 168 HOURS")
print(f"Maintained: {table}")
Output
- Unity Catalog hierarchy with env-isolated catalogs and medallion schemas
- Asset Bundle with dev/staging/prod targets and variable overrides
- Medallion pipeline (Auto Loader > MERGE > aggregations)
- RBAC grants separating engineer write from analyst read-only
- Table maintenance schedule (weekly OPTIMIZE + VACUUM)
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Schema evolution failure | New source columns | Auto Loader handles with schemaEvolutionMode |
| Permission denied on schema | Missing USAGE on parent catalog | GRANT USAGE ON CATALOG first |
| Concurrent write conflict | Multiple jobs writing same table | max_concurrent_runs: 1 in job config |
| Cluster timeout | Long-running tasks | Set timeout_seconds per task |
Examples
Validate Data Flow
SELECT 'bronze' AS layer, COUNT(*) AS rows FROM prod_catalog.bronze.raw_events
UNION ALL SELECT 'silver', COUNT(*) FROM prod_catalog.silver.events
UNION ALL SELECT 'gold', COUNT(*) FROM prod_catalog.gold.daily_metrics;
Resources
Related skills
Databricks Core
databricks
Authenticate, configure, and explore data with Databricks CLI commands.
Databricks DABs Manager
databricks
Create, configure, and deploy Databricks Declarative Automation Bundles for dashboards, jobs, and pipelines.
Databricks Jobs
databricks
Create and deploy data engineering jobs on Databricks using notebooks, Python, SQL, or pipelines.
Databricks Pipelines
databricks
Build batch or streaming data pipelines on Databricks with Python or SQL.