Audio/Video Transcription

By daymade· daymade/claude-code-skills· 0

Transcribes audio and video files to text using Qwen3-ASR. Supports two modes — local MLX inference on macOS Apple Silicon (no API key, 15-27x realtime) and remote API via vLLM/OpenAI-compatible endpoints. Auto-detects platform and recommends the best path. Triggers when the user wants to transcribe recordings, convert audio/video to text, do speech-to-text, or mentions ASR, Qwen ASR, 转录, 语音转文字, 录音转文字. Also triggers for meeting recordings, lectures, interviews, podcasts, screen recordings, or any audio/video file the user wants converted to text.

Installation

1
Make sure Claude is on your device and in your terminal.
Skills load from ~/.claude/skills/ when Claude Code starts up — so you need it on your machine first. If you don't have it yet, install it once with the command below, then run claude in any terminal to verify.
One-time setup
```
npm i -g @anthropic-ai/claude-code
```
Already have it? Skip ahead.
2
Paste into Claude Code or into your terminal.
Install
```
git clone https://github.com/daymade/claude-code-skills.git /tmp/daymade__claude-code-skills && mkdir -p ~/.claude/skills/asr-transcribe-to-text-daymade && cp -r /tmp/daymade__claude-code-skills/asr-transcribe-to-text/. ~/.claude/skills/asr-transcribe-to-text-daymade/
```
This copies the whole skill folder into ~/.claude/skills/asr-transcribe-to-text-daymade/ — the SKILL.md plus any scripts, reference docs, or templates the skill ships with. Safe default: works for every skill.
Faster alternative (instruction-only skills)
Skips the clone and grabs only the SKILL.md file. Don't use this if the skill ships Python scripts, reference markdowns, or asset templates — they won't be downloaded and the skill will fail when it tries to load them.
Quick install (SKILL.md only)
mkdir -p ~/.claude/skills/asr-transcribe-to-text-daymade && curl -fsSL https://raw.githubusercontent.com/daymade/claude-code-skills/main/asr-transcribe-to-text/SKILL.md -o ~/.claude/skills/asr-transcribe-to-text-daymade/SKILL.md
Restart Claude Code.
Quit and reopen Claude Code (or any other agent that loads from ~/.claude/skills/). New skills are picked up on startup.
Just ask Claude.
Skills auto-activate when your request matches the skill's description — no slash command needed. Trigger phrases live in the skill's own frontmatter; you can read them in the “What this skill does” section above.

Prefer to read the source first? Open on GitHub.

When Claude uses it

What this skill does

ASR Transcribe to Text

Transcribe audio/video files to text using Qwen3-ASR. Two inference paths:

Mode	When	Speed	Cost
Local MLX	macOS Apple Silicon	15-27x realtime	Free
Remote API	Any platform, or when local unavailable	Depends on GPU	API/self-hosted

Configuration persists in ${CLAUDE_PLUGIN_DATA}/config.json.

Step 0: Detect Platform and Load Config

cat "${CLAUDE_PLUGIN_DATA}/config.json" 2>/dev/null

If config exists, read values and proceed to Step 1.

If config does not exist, auto-detect platform first:

python3 -c "
import sys, platform
is_mac_arm = sys.platform == 'darwin' and platform.machine() in ('arm64', 'aarch64')
print(f'Platform: {sys.platform} {platform.machine()}')
print(f'Apple Silicon: {is_mac_arm}')
if is_mac_arm:
    print('RECOMMEND: local-mlx')
else:
    print('RECOMMEND: remote-api')
"

Then use AskUserQuestion with platform-aware defaults:

For macOS Apple Silicon (recommended: local):

ASR setup — your Mac has Apple Silicon, so local transcription is recommended.

Q1: Transcription mode?
  A) Local MLX — runs on your Mac's GPU, no API key needed, 15-27x realtime (Recommended)
  B) Remote API — send audio to a server (vLLM, Tailscale workstation, etc.)

Q2: Does your network have an HTTP proxy that might intercept traffic?
  A) Yes — bypass proxy for ASR traffic (Recommended if using Shadowrocket/Clash)
  B) No — direct connection

For other platforms (recommended: remote):

ASR setup — local MLX requires macOS Apple Silicon. Using remote API mode.

Q1: ASR Endpoint URL?
  A) http://workstation-4090-wsl:8002/v1/audio/transcriptions (Qwen3-ASR vLLM via Tailscale)
  B) http://localhost:8002/v1/audio/transcriptions (Local server)
  C) Custom URL

Q2: Proxy bypass needed?
  A) Yes (Recommended for Shadowrocket/Clash/corporate proxy)
  B) No

Save config:

mkdir -p "${CLAUDE_PLUGIN_DATA}"
python3 -c "
import json
config = {
    'mode': 'MODE',           # 'local-mlx' or 'remote-api'
    'model': 'MODEL_ID',      # local: 'mlx-community/Qwen3-ASR-1.7B-8bit', remote: 'Qwen/Qwen3-ASR-1.7B'
    'max_tokens': 200000,     # local only, critical for long audio
    'endpoint': 'URL',        # remote only
    'noproxy': True,
    'max_timeout': 900        # remote only
}
with open('${CLAUDE_PLUGIN_DATA}/config.json', 'w') as f:
    json.dump(config, f, indent=2)
print('Config saved.')
"

Step 1: Extract Audio (if input is video)

For video files (mp4, mov, mkv, avi, webm), extract as 16kHz mono WAV:

ffmpeg -i INPUT_VIDEO -vn -acodec pcm_s16le -ar 16000 -ac 1 OUTPUT.wav -y

Audio files (wav, mp3, m4a, flac, ogg) can be used directly. Get duration:

ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 INPUT_FILE

Cleanup: After transcription succeeds, delete extracted WAV files to save disk space.

Step 2: Transcribe

Path A: Local MLX (macOS Apple Silicon)

Use the bundled script — it handles model loading, chunking, and the critical max_tokens parameter:

uv run ${CLAUDE_PLUGIN_ROOT}/scripts/transcribe_local_mlx.py \
  INPUT_AUDIO [INPUT_AUDIO2 ...] \
  --output-dir OUTPUT_DIR

The script loads the model once and transcribes all files sequentially (no GPU contention). For details on performance, model compatibility, and the max_tokens truncation issue, see references/local_mlx_guide.md.

Critical: The upstream mlx-audio default max_tokens=8192 silently truncates audio longer than ~40 minutes. The bundled script defaults to 200000. If calling model.generate() directly, always pass max_tokens=200000.

Path B: Remote API

Health check first (skip if already verified this session):

python3 -c "
import json, subprocess, sys
with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
    cfg = json.load(f)
base = cfg['endpoint'].rsplit('/audio/', 1)[0]
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
result = subprocess.run(
    ['curl', '-s', '--max-time', '10'] + noproxy + [f'{base}/models'],
    capture_output=True, text=True
)
if result.returncode != 0 or not result.stdout.strip():
    print(f'HEALTH CHECK FAILED: {base}/models', file=sys.stderr)
    sys.exit(1)
print(f'Service healthy: {base}')
"

Read config and send via curl:

python3 -c "
import json, subprocess, sys, os, tempfile
with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
    cfg = json.load(f)
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
timeout = str(cfg.get('max_timeout', 900))
audio_file = 'AUDIO_FILE_PATH'
output_json = tempfile.mktemp(suffix='.json', prefix='asr_')

result = subprocess.run(
    ['curl', '-s', '--max-time', timeout] + noproxy + [
        cfg['endpoint'],
        '-F', f'file=@{audio_file}',
        '-F', f'model={cfg[\"model\"]}',
        '-o', output_json
    ], capture_output=True, text=True
)

with open(output_json) as f:
    data = json.load(f)
if 'text' not in data:
    print(f'ERROR: {json.dumps(data)[:300]}', file=sys.stderr)
    sys.exit(1)
text = data['text']
print(f'Transcribed: {len(text)} chars', file=sys.stderr)
print(text)
os.unlink(output_json)
" > OUTPUT.txt

If remote health check fails, diagnose in order:

Network: ping -c 1 HOST or tailscale status | grep HOST
Service: tailscale ssh USER@HOST "curl -s localhost:PORT/v1/models"
Proxy: retry with --noproxy '*' toggled

Step 3: Verify Output

After transcription, check for truncation — the most common failure mode:

Confirm output is not empty
Check character count is plausible (~400 chars/min for Chinese, ~200 words/min for English)
Check the ending — does it trail off mid-sentence? If so, max_tokens was exhausted
Show user the first and last ~200 characters as preview

If truncated or wrong, use AskUserQuestion:

Transcription may be truncated:
- Expected: ~[N] chars for [M] minutes of audio
- Got: [actual] chars ([pct]% of expected)
- Last line: "[last 100 chars...]"

Options:
A) Retry with higher max_tokens (current: [N], try: [N*2])
B) Switch mode — try [local/remote] instead
C) Save as-is — the output looks complete to me
D) Abort

Step 4: Fallback — Overlap-Merge (Remote API Only)

If single remote request fails (timeout, OOM), fall back to chunked transcription:

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/overlap_merge_transcribe.py \
  --config "${CLAUDE_PLUGIN_DATA}/config.json" \
  INPUT_AUDIO OUTPUT.txt

Splits into 18-minute chunks with 2-minute overlap, merges using punctuation-stripped fuzzy matching. See references/overlap_merge_strategy.md for algorithm details.

For local MLX mode, overlap-merge is unnecessary — the bundled script handles chunking internally with max_tokens=200000.

Step 5: Recommend Transcript Correction

ASR output always contains recognition errors — homophones, garbled technical terms, broken sentences. After successful transcription, proactively suggest running the transcript-fixer skill on the output:

Transcription complete: [N] chars saved to [output_path].

ASR output typically contains recognition errors (homophones, garbled terms, broken sentences).
Would you like me to run /transcript-fixer to clean up the text?

Options:
A) Yes — run transcript-fixer on the output now (Recommended)
B) No — the raw transcription is good enough for my needs
C) Later — I'll run it myself when ready

If the user chooses A, invoke the transcript-fixer skill with the output file path. The two skills form a natural pipeline: transcribe → correct → review.

Reconfigure

rm "${CLAUDE_PLUGIN_DATA}/config.json"

Then re-run Step 0.

Bundled Resources

Scripts:

transcribe_local_mlx.py — Local MLX transcription (macOS ARM64, PEP 723 deps)
overlap_merge_transcribe.py — Chunked transcription with overlap merge (remote API fallback)

References:

local_mlx_guide.md — Performance benchmarks, max_tokens truncation, model compatibility
overlap_merge_strategy.md — Why naive chunking fails, fuzzy merge algorithm

Related skills

Skill Builder & Optimizer

anthropics

Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.

Official

Org Change Management

alirezarezvani

Framework for rolling out organizational changes without chaos. Covers the ADKAR model adapted for startups, communication templates, resistance patterns, and change fatigue management. Handles process changes, org restructures, strategy pivots, and culture changes. Use when announcing a reorg, switching tools, pivoting strategy, killing a product, changing leadership, or when user mentions change management, change rollout, managing resistance, org change, reorg, or pivot communication.

MIT

Claude Export Conversation Fixer

daymade

Fixes broken line wrapping in Claude Code exported conversation files (.txt), reconstructing tables, paragraphs, paths, and tool calls that were hard-wrapped at fixed column widths. Includes an automated validation suite (generic, file-agnostic checks). Triggers when the user has a Claude Code export file with broken formatting, mentions "fix export", "fix conversation", "exported conversation", "make export readable", references a file matching YYYY-MM-DD-HHMMSS-*.txt, or has a .txt file with broken tables, split paths, or mangled tool output from Claude Code.

Claude Skills Troubleshooter

daymade

Diagnose and resolve Claude Code plugin and skill issues. This skill should be used when plugins are installed but not showing in available skills list, skills are not activating as expected, or when troubleshooting enabledPlugins configuration in settings.json. Triggers include "plugin not working", "skill not showing", "installed but disabled", or "enabledPlugins" issues.