LLM+

Document to Markdown Converter

Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, simple tables, image paths, CJK bold spacing, attribute noise, and code blocks. Benchmarked best-in-class (7.6/10) against Docling, MarkItDown, Pandoc raw, and Mammoth. Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "解析word", "转换文档".

Installation

  1. Make sure Claude is on your device and in your terminal.

    Skills load from ~/.claude/skills/ when Claude Code starts up — so you need it on your machine first. If you don't have it yet, install it once with the command below, then run claude in any terminal to verify.

    One-time setup
    npm i -g @anthropic-ai/claude-code

    Already have it? Skip ahead.

  2. Paste into Claude Code or into your terminal.
    Install
    git clone https://github.com/daymade/claude-code-skills.git /tmp/daymade__claude-code-skills && mkdir -p ~/.claude/skills/doc-to-markdown-daymade && cp -r /tmp/daymade__claude-code-skills/daymade-docs/doc-to-markdown/. ~/.claude/skills/doc-to-markdown-daymade/

    This copies the whole skill folder into ~/.claude/skills/doc-to-markdown-daymade/ — the SKILL.md plus any scripts, reference docs, or templates the skill ships with. Safe default: works for every skill.

    Faster alternative (instruction-only skills)

    Skips the clone and grabs only the SKILL.md file. Don't use this if the skill ships Python scripts, reference markdowns, or asset templates — they won't be downloaded and the skill will fail when it tries to load them.

    Quick install (SKILL.md only)
    mkdir -p ~/.claude/skills/doc-to-markdown-daymade && curl -fsSL https://raw.githubusercontent.com/daymade/claude-code-skills/main/daymade-docs/doc-to-markdown/SKILL.md -o ~/.claude/skills/doc-to-markdown-daymade/SKILL.md
  3. Restart Claude Code.

    Quit and reopen Claude Code (or any other agent that loads from ~/.claude/skills/). New skills are picked up on startup.

  4. Just ask Claude.

    Skills auto-activate when your request matches the skill's description — no slash command needed. Trigger phrases live in the skill's own frontmatter; you can read them in the “What this skill does” section above.

Prefer to read the source first? Open on GitHub.

When Claude uses it

Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, simple tables, image paths, CJK bold spacing, attribute noise, and code blocks. Benchmarked best-in-class (7.6/10) against Docling, MarkItDown, Pandoc raw, and Mammoth. Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "解析word", "转换文档".

What this skill does

Doc to Markdown

Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.

Architecture: Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add).

Quick Start

# DOCX → Markdown (one command, zero manual fixes)
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.docx -o output.md --assets-dir ./media

# PDF → Markdown
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md

# Run tests
uv run --with pytest pytest scripts/test_convert.py -v

Dual Mode

ModeSpeedQualityUse Case
Quick (default)FastGoodDrafts, simple documents
HeavySlowerBestFinal documents, complex layouts

Tool Selection

FormatQuick ModeHeavy Mode
PDFpymupdf4llmpymupdf4llm + markitdown
DOCXpandoc + post-processingpandoc + markitdown
PPTXmarkitdownmarkitdown + pandoc
XLSXmarkitdownmarkitdown

DOCX Post-Processing (automatic)

When converting DOCX via pandoc, 8 cleanups are applied automatically:

ProblemFixTest coverage
Grid tables (+:---+)Single-column → blockquote, multi-column → pipe tableTestPostprocessPipeline
Simple tables ( ---- ----)Multi-column images → pipe table with captionsTestSimpleTable
Image path nesting (media/media/)Flatten to media/, absolute → relativetest_stats_tracking
Pandoc attributes ({width="..."})Removedtest_pandoc_attributes_removed
CJK bold spacing (**粗体**中文)Add space around ** for CJK bold spansTestCjkBoldSpacing (15 cases)
Indented dashed code blocks→ fenced ``` with language detectiontest_code_block_with_language
Escaped brackets (\[...\])[...]test_escaped_brackets_fixed
Double-bracket links ([[text]](url))[text](url)test_double_bracket_links_fixed

CJK Bold Spacing — why and how

DOCX uses run-level styling (no spaces between bold/normal runs in CJK text). Markdown renderers need whitespace around ** to recognize bold boundaries.

Rule: if a **content** span contains any CJK character, ensure both sides have a space — unless already spaced or at line boundary. This handles CJK punctuation, emoji adjacency, and mixed content.

Before: 打开**飞书**,就可以    → some renderers fail to bold
After:  打开 **飞书** ,就可以  → universally renders correctly

Heavy Mode Workflow

Heavy Mode runs multiple tools in parallel and selects the best segments:

  1. Parallel Execution: Run all applicable tools simultaneously
  2. Segment Analysis: Parse each output into segments (tables, headings, images, paragraphs)
  3. Quality Scoring: Score each segment based on completeness and structure
  4. Intelligent Merge: Select best version of each segment across tools

Merge Criteria

Segment TypeSelection Criteria
TablesMore rows/columns, proper header separator
ImagesAlt text present, local paths preferred
HeadingsProper hierarchy, appropriate length
ListsMore items, nested structure preserved
ParagraphsContent completeness

Image Extraction

# Extract images with metadata
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./extracted-images

# Generate markdown references file
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md

Output:

  • Images: extracted-images/img_page1_1.png, extracted-images/img_page2_1.jpg
  • Metadata: extracted-images/images_metadata.json (page, position, dimensions)

Quality Validation

# Validate conversion quality
uv run --with pymupdf scripts/validate_output.py document.pdf output.md

# Generate HTML report
uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html

Quality Metrics

MetricPassWarnFail
Text Retention>95%85-95%<85%
Table Retention100%90-99%<90%
Image Retention100%80-99%<80%

Merge Outputs Manually

# Merge multiple markdown files
python scripts/merge_outputs.py output1.md output2.md -o merged.md

# Show segment attribution
python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose

Path Conversion (Windows/WSL)

# Windows to WSL conversion
python scripts/convert_path.py "C:\Users\<windows-user>\Documents\file.pdf"
# Output: /mnt/c/Users/<windows-user>/Documents/file.pdf

Common Issues

"No conversion tools available"

# Install all tools
pip install pymupdf4llm
uv tool install "markitdown[pdf]"
brew install pandoc

FontBBox warnings during PDF conversion

  • Harmless font parsing warnings, output is still correct

Images missing from output

  • Use Heavy Mode for better image preservation
  • Or extract separately with scripts/extract_pdf_images.py

Tables broken in output

  • Use Heavy Mode - it selects the most complete table version
  • Or validate with scripts/validate_output.py

Bundled Scripts

ScriptPurpose
convert.pyMain orchestrator with Quick/Heavy mode + DOCX post-processing
test_convert.py31 tests covering all post-processing functions
merge_outputs.pyMerge multiple markdown outputs
validate_output.pyQuality validation with HTML report
extract_pdf_images.pyPDF image extraction with metadata
convert_path.pyWindows to WSL path converter

References

  • references/benchmark-2026-03-22.md - 5-tool benchmark (Docling/MarkItDown/Pandoc/Mammoth/ours)
  • references/heavy-mode-guide.md - Detailed Heavy Mode documentation
  • references/tool-comparison.md - Tool capabilities comparison
  • references/conversion-examples.md - Batch operation examples

Next Step: Clean Up Converted Content

After converting documents to markdown, suggest cleanup:

Conversion complete: [N] files converted to markdown.

Options:
A) Clean up docs — run /docs-cleaner to consolidate redundant content (Recommended if multiple files)
B) Check facts — run /fact-checker to verify claims in the converted content
C) No thanks — the markdown conversion is sufficient

Related skills

D

Documentation Co-Authoring

anthropics

Guide users through a structured workflow for co-authoring documentation. Use when user wants to write documentation, proposals, technical specs, decision docs, or similar structured content. This workflow helps users efficiently transfer context, refine content through iteration, and verify the doc works for readers. Trigger when user mentions writing docs, creating proposals, drafting specs, or similar documentation tasks.

Official
M

MCP Server Builder

anthropics

Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP) or Node/TypeScript (MCP SDK).

OfficialComplete terms in LICENSE.txt
S

Skill Builder & Optimizer

anthropics

Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.

Official
M

Multi-Component Web Artifacts

anthropics

Suite of tools for creating elaborate, multi-component claude.ai HTML artifacts using modern frontend web technologies (React, Tailwind CSS, shadcn/ui). Use for complex artifacts requiring state management, routing, or shadcn/ui components - not for simple single-file HTML/JSX artifacts.

OfficialComplete terms in LICENSE.txt