Spark Structured Streaming
Build production streaming pipelines with Kafka, joins, and real-time processing.
Installation
- Make sure Claude is on your device and in your terminal.
Skills load from
~/.claude/skills/when Claude Code starts up — so you need it on your machine first. If you don't have it yet, install it once with the command below, then runclaudein any terminal to verify.One-time setupnpm i -g @anthropic-ai/claude-codeAlready have it? Skip ahead.
- Paste into Claude Code or into your terminal.
This copies the whole skill folder into
~/.claude/skills/databricks-spark-structured-streaming/— the SKILL.md plus any scripts, reference docs, or templates the skill ships with. Safe default: works for every skill.Faster alternative (instruction-only skills)
Skips the clone and grabs only the SKILL.md file. Don't use this if the skill ships Python scripts, reference markdowns, or asset templates — they won't be downloaded and the skill will fail when it tries to load them.
Quick install (SKILL.md only)Sign up to copy - Restart Claude Code.
Quit and reopen Claude Code (or any other agent that loads from
~/.claude/skills/). New skills are picked up on startup. - Just ask Claude.
Skills auto-activate when your request matches the skill's description — no slash command needed. Trigger phrases live in the skill's own frontmatter; you can read them in the “What this skill does” section above.
Prefer to read the source first? Open on GitHub.
When Claude uses it
Comprehensive guide to Spark Structured Streaming for production workloads. Use when building streaming pipelines, working with Kafka ingestion, implementing Real-Time Mode (RTM), configuring triggers (processingTime, availableNow), handling stateful operations with watermarks, optimizing checkpoints, performing stream-stream or stream-static joins, writing to multiple sinks, or tuning streaming cost and performance.
What this skill does
Spark Structured Streaming
Production-ready streaming pipelines with Spark Structured Streaming. This skill provides navigation to detailed patterns and best practices.
Quick Start
from pyspark.sql.functions import col, from_json
# Basic Kafka to Delta streaming
df = (spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "topic")
.load()
.select(from_json(col("value").cast("string"), schema).alias("data"))
.select("data.*")
)
df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/Volumes/catalog/checkpoints/stream") \
.trigger(processingTime="30 seconds") \
.start("/delta/target_table")
Core Patterns
| Pattern | Description | Reference |
|---|---|---|
| Kafka Streaming | Kafka to Delta, Kafka to Kafka, Real-Time Mode | See references/kafka-streaming.md |
| Stream Joins | Stream-stream joins, stream-static joins | See references/stream-stream-joins.md, references/stream-static-joins.md |
| Multi-Sink Writes | Write to multiple tables, parallel merges | See references/multi-sink-writes.md |
| Merge Operations | MERGE performance, parallel merges, optimizations | See references/merge-operations.md |
Configuration
| Topic | Description | Reference |
|---|---|---|
| Checkpoints | Checkpoint management and best practices | See references/checkpoint-best-practices.md |
| Stateful Operations | Watermarks, state stores, RocksDB configuration | See references/stateful-operations.md |
| Trigger & Cost | Trigger selection, cost optimization, RTM | See references/trigger-and-cost-optimization.md |
Best Practices
| Topic | Description | Reference |
|---|---|---|
| Production Checklist | Comprehensive best practices | See references/streaming-best-practices.md |
Production Checklist
- Checkpoint location is persistent (UC volumes, not DBFS)
- Unique checkpoint per stream
- Fixed-size cluster (no autoscaling for streaming)
- Monitoring configured (input rate, lag, batch duration)
- Exactly-once verified (txnVersion/txnAppId)
- Watermark configured for stateful operations
- Left joins for stream-static (not inner)
Related skills
Databricks Core
databricks
Authenticate, configure, and explore data with Databricks CLI commands.
Databricks DABs Manager
databricks
Create, configure, and deploy Databricks Declarative Automation Bundles for dashboards, jobs, and pipelines.
Databricks Jobs
databricks
Create and deploy data engineering jobs on Databricks using notebooks, Python, SQL, or pipelines.
Databricks Pipelines
databricks
Build batch or streaming data pipelines on Databricks with Python or SQL.