AugmentClaude
Proprietary (Anthropic). Condensed excerpt; see LICENSE in anthropics/skills.TutoringDocs

PDF Processing

Read, extract, convert, merge, split, and create PDF files with OCR support.

Installation

  1. Make sure Claude is on your device and in your terminal.

    Skills load from ~/.claude/skills/ when Claude Code starts up — so you need it on your machine first. If you don't have it yet, install it once with the command below, then run claude in any terminal to verify.

    One-time setup
    npm i -g @anthropic-ai/claude-code

    Already have it? Skip ahead.

  2. Paste into Claude Code or into your terminal.

    This copies the whole skill folder into ~/.claude/skills/pdf-optimeta/ — the SKILL.md plus any scripts, reference docs, or templates the skill ships with. Safe default: works for every skill.

    Faster alternative (instruction-only skills)

    Skips the clone and grabs only the SKILL.md file. Don't use this if the skill ships Python scripts, reference markdowns, or asset templates — they won't be downloaded and the skill will fail when it tries to load them.

    Quick install (SKILL.md only)
    Sign up to copy
  3. Restart Claude Code.

    Quit and reopen Claude Code (or any other agent that loads from ~/.claude/skills/). New skills are picked up on startup.

  4. Just ask Claude.

    Skills auto-activate when your request matches the skill's description — no slash command needed. Trigger phrases live in the skill's own frontmatter; you can read them in the “What this skill does” section above.

Prefer to read the source first? Open on GitHub.

When Claude uses it

Use whenever the user works with PDF files — reading/extracting text from PDFs (lecture notes, textbook chapters, HW problems, HW solutions, hand-written answers), converting PDFs to markdown for downstream analysis, merging/splitting PDFs, or creating PDFs. For scanned or hand-written PDFs, OCR is required (pytesseract + pdf2image). Based on Anthropic's official PDF skill (github.com/anthropics/skills/tree/main/skills/pdf).

What this skill does

PDF Processing Guide

When to use this skill

Load this skill whenever the workflow involves PDF input or output. In the paideia context specifically:

  • Converting materials/**/*.pdf to markdown in converted/**/*.md (via /ingest)
  • Converting hand-written answer PDFs in answers/*.pdf to markdown in answers/converted/*.md (via /grade)
  • OCR for scanned lecture notes, textbook chapters, or hand-written work

Quick decision tree

What kind of PDF?
├─ Course material (materials/**/*.pdf)  → VISION pipeline (see VISION.md)
│                                          pdfplumber is unreliable on course
│                                          content — even "prose-heavy"
│                                          textbook pages mix in equations,
│                                          figures, and multi-column layouts
│                                          that break digital extraction
│                                          silently. We route everything
│                                          through vision instead of
│                                          maintaining a per-category heuristic.
├─ Hand-written answer PDF              → vision-ocr skill (see vision-ocr/)
└─ Arbitrary outside-the-plugin PDF     → pdfplumber / pypdf / pytesseract
                                          per the sections below, case-by-case

Within this plugin, /paideia:ingest routes all materials/**/*.pdf through the vision pipeline. The pdfplumber / pypdf / pytesseract blocks below remain for reference and for ad-hoc PDF work outside the ingest flow (e.g., quick text dumps, PDF merge/split, producing the cheatsheet PDF).

Core operations

Text extraction (digital PDF)

import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    text_by_page = []
    for page in pdf.pages:
        text_by_page.append(page.extract_text() or "")
full_text = "\n\n---\n\n".join(text_by_page)

Simpler alternative using pypdf:

from pypdf import PdfReader
reader = PdfReader("input.pdf")
full_text = "\n\n".join(p.extract_text() or "" for p in reader.pages)

OCR (scanned or hand-written PDF)

Install deps once:

pip install --break-system-packages pytesseract pdf2image
# Also needs system tesseract: apt-get install tesseract-ocr poppler-utils
import pytesseract
from pdf2image import convert_from_path

images = convert_from_path("scanned.pdf", dpi=200)
text = ""
for i, image in enumerate(images):
    text += f"\n\n## Page {i+1}\n\n"
    text += pytesseract.image_to_string(image, lang="eng+kor")  # multi-lang

For best OCR quality on math/physics hand-writing, use dpi=300 and consider preprocessing (deskew, binarize) with opencv before OCR.

Command-line text extraction (fast path)

# Requires: apt-get install poppler-utils
pdftotext -layout input.pdf output.txt

Merge / split

from pypdf import PdfReader, PdfWriter

# Merge
writer = PdfWriter()
for f in ["chap1.pdf", "chap2.pdf"]:
    for page in PdfReader(f).pages:
        writer.add_page(page)
with open("merged.pdf", "wb") as out:
    writer.write(out)

# Split single page
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    w = PdfWriter()
    w.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as out:
        w.write(out)

PDF creation (for producing clean cheatsheets)

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("output.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = [Paragraph("Title", styles['Title']), Spacer(1, 12)]
# Use <sub> and <super> tags, NEVER Unicode subscripts (they render as black boxes)
story.append(Paragraph("H<sub>2</sub>O and E = mc<super>2</super>", styles['Normal']))
doc.build(story)

Course-cram specific conventions

When converting PDF materials to markdown for this project:

  1. Preserve structure. Section headers (##), numbered lists, tables. Do NOT reflow paragraphs — keep line breaks roughly aligned with source for verifiability.

  2. Math formatting. Convert inline math to $...$, display math to $$...$$. If extraction produces garbled LaTeX, mark with [?] and move on — don't guess.

  3. Name convention. materials/lectures/chapter03.pdfconverted/lectures/chapter03.md. Preserve subfolder structure.

  4. Provenance markers. Prepend the output file with a source comment tagging the extraction method:

    <!-- SOURCE: materials/<cat>/<stem>.pdf, extracted <YYYY-MM-DD>, method: pdfplumber|vision|ocr -->
    

    For OCR specifically, append: accuracy may vary. Verify math expressions manually.

  5. Idempotence. If converted/X.md already exists and is newer than materials/X.pdf, skip (unless user passes --force).

  6. Default route for all materials/**/*.pdf is the vision pipeline (see VISION.md). pdfplumber was tried as a fast path for prose-heavy material and proved unreliable in practice — even textbook pages silently word-salad when they mix equations, multi-column layouts, or figure captions. Uniform vision routing is simpler and more reliable than per-category heuristics with fallbacks.

  7. Hand-written answer PDFs. Output to answers/converted/<name>.md. Expect garbled math; the grading step handles ambiguity via strategy-matching, not exact algebra.

Error patterns to watch for

  • Empty extracted text (page.extract_text() returns "") → it's scanned. Fall through to OCR.
  • Unicode subscript/superscript in reportlab → renders as solid black boxes. Use <sub>/<super> XML tags instead.
  • Protected PDFsqpdf --password=... --decrypt in.pdf out.pdf first.
  • Multi-column academic PDFs → pdfplumber's default extraction interleaves columns. Use page.extract_text(layout=True) or crop bboxes per column.
  • Image-heavy scansconvert_from_path uses a lot of memory. Set dpi=150 for first pass, re-run at 300 only if OCR quality is poor.

Dependencies

Standard install for paideia use:

pip install --break-system-packages pypdf pdfplumber pytesseract pdf2image reportlab
apt-get install -y poppler-utils tesseract-ocr tesseract-ocr-kor

The Korean language pack (tesseract-ocr-kor) is needed if the user writes solutions in Korean/Hangul.

Reference

Full skill at https://github.com/anthropics/skills/tree/main/skills/pdf with REFERENCE.md covering pypdfium2, JavaScript libraries, and FORMS.md covering PDF form filling.

Related skills