PDF Processing
Read, extract, convert, merge, split, and create PDF files with OCR support.
Installation
- Make sure Claude is on your device and in your terminal.
Skills load from
~/.claude/skills/when Claude Code starts up — so you need it on your machine first. If you don't have it yet, install it once with the command below, then runclaudein any terminal to verify.One-time setupnpm i -g @anthropic-ai/claude-codeAlready have it? Skip ahead.
- Paste into Claude Code or into your terminal.
This copies the whole skill folder into
~/.claude/skills/pdf-optimeta/— the SKILL.md plus any scripts, reference docs, or templates the skill ships with. Safe default: works for every skill.Faster alternative (instruction-only skills)
Skips the clone and grabs only the SKILL.md file. Don't use this if the skill ships Python scripts, reference markdowns, or asset templates — they won't be downloaded and the skill will fail when it tries to load them.
Quick install (SKILL.md only)Sign up to copy - Restart Claude Code.
Quit and reopen Claude Code (or any other agent that loads from
~/.claude/skills/). New skills are picked up on startup. - Just ask Claude.
Skills auto-activate when your request matches the skill's description — no slash command needed. Trigger phrases live in the skill's own frontmatter; you can read them in the “What this skill does” section above.
Prefer to read the source first? Open on GitHub.
When Claude uses it
Use whenever the user works with PDF files — reading/extracting text from PDFs (lecture notes, textbook chapters, HW problems, HW solutions, hand-written answers), converting PDFs to markdown for downstream analysis, merging/splitting PDFs, or creating PDFs. For scanned or hand-written PDFs, OCR is required (pytesseract + pdf2image). Based on Anthropic's official PDF skill (github.com/anthropics/skills/tree/main/skills/pdf).
What this skill does
PDF Processing Guide
When to use this skill
Load this skill whenever the workflow involves PDF input or output. In the paideia context specifically:
- Converting
materials/**/*.pdfto markdown inconverted/**/*.md(via/ingest) - Converting hand-written answer PDFs in
answers/*.pdfto markdown inanswers/converted/*.md(via/grade) - OCR for scanned lecture notes, textbook chapters, or hand-written work
Quick decision tree
What kind of PDF?
├─ Course material (materials/**/*.pdf) → VISION pipeline (see VISION.md)
│ pdfplumber is unreliable on course
│ content — even "prose-heavy"
│ textbook pages mix in equations,
│ figures, and multi-column layouts
│ that break digital extraction
│ silently. We route everything
│ through vision instead of
│ maintaining a per-category heuristic.
├─ Hand-written answer PDF → vision-ocr skill (see vision-ocr/)
└─ Arbitrary outside-the-plugin PDF → pdfplumber / pypdf / pytesseract
per the sections below, case-by-case
Within this plugin, /paideia:ingest routes all materials/**/*.pdf through the vision pipeline. The pdfplumber / pypdf / pytesseract blocks below remain for reference and for ad-hoc PDF work outside the ingest flow (e.g., quick text dumps, PDF merge/split, producing the cheatsheet PDF).
Core operations
Text extraction (digital PDF)
import pdfplumber
with pdfplumber.open("input.pdf") as pdf:
text_by_page = []
for page in pdf.pages:
text_by_page.append(page.extract_text() or "")
full_text = "\n\n---\n\n".join(text_by_page)
Simpler alternative using pypdf:
from pypdf import PdfReader
reader = PdfReader("input.pdf")
full_text = "\n\n".join(p.extract_text() or "" for p in reader.pages)
OCR (scanned or hand-written PDF)
Install deps once:
pip install --break-system-packages pytesseract pdf2image
# Also needs system tesseract: apt-get install tesseract-ocr poppler-utils
import pytesseract
from pdf2image import convert_from_path
images = convert_from_path("scanned.pdf", dpi=200)
text = ""
for i, image in enumerate(images):
text += f"\n\n## Page {i+1}\n\n"
text += pytesseract.image_to_string(image, lang="eng+kor") # multi-lang
For best OCR quality on math/physics hand-writing, use dpi=300 and consider preprocessing (deskew, binarize) with opencv before OCR.
Command-line text extraction (fast path)
# Requires: apt-get install poppler-utils
pdftotext -layout input.pdf output.txt
Merge / split
from pypdf import PdfReader, PdfWriter
# Merge
writer = PdfWriter()
for f in ["chap1.pdf", "chap2.pdf"]:
for page in PdfReader(f).pages:
writer.add_page(page)
with open("merged.pdf", "wb") as out:
writer.write(out)
# Split single page
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
w = PdfWriter()
w.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as out:
w.write(out)
PDF creation (for producing clean cheatsheets)
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("output.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = [Paragraph("Title", styles['Title']), Spacer(1, 12)]
# Use <sub> and <super> tags, NEVER Unicode subscripts (they render as black boxes)
story.append(Paragraph("H<sub>2</sub>O and E = mc<super>2</super>", styles['Normal']))
doc.build(story)
Course-cram specific conventions
When converting PDF materials to markdown for this project:
-
Preserve structure. Section headers (
##), numbered lists, tables. Do NOT reflow paragraphs — keep line breaks roughly aligned with source for verifiability. -
Math formatting. Convert inline math to
$...$, display math to$$...$$. If extraction produces garbled LaTeX, mark with[?]and move on — don't guess. -
Name convention.
materials/lectures/chapter03.pdf→converted/lectures/chapter03.md. Preserve subfolder structure. -
Provenance markers. Prepend the output file with a source comment tagging the extraction method:
<!-- SOURCE: materials/<cat>/<stem>.pdf, extracted <YYYY-MM-DD>, method: pdfplumber|vision|ocr -->For OCR specifically, append:
accuracy may vary. Verify math expressions manually. -
Idempotence. If
converted/X.mdalready exists and is newer thanmaterials/X.pdf, skip (unless user passes--force). -
Default route for all
materials/**/*.pdfis the vision pipeline (seeVISION.md).pdfplumberwas tried as a fast path for prose-heavy material and proved unreliable in practice — even textbook pages silently word-salad when they mix equations, multi-column layouts, or figure captions. Uniform vision routing is simpler and more reliable than per-category heuristics with fallbacks. -
Hand-written answer PDFs. Output to
answers/converted/<name>.md. Expect garbled math; the grading step handles ambiguity via strategy-matching, not exact algebra.
Error patterns to watch for
- Empty extracted text (
page.extract_text()returns"") → it's scanned. Fall through to OCR. - Unicode subscript/superscript in reportlab → renders as solid black boxes. Use
<sub>/<super>XML tags instead. - Protected PDFs →
qpdf --password=... --decrypt in.pdf out.pdffirst. - Multi-column academic PDFs → pdfplumber's default extraction interleaves columns. Use
page.extract_text(layout=True)or crop bboxes per column. - Image-heavy scans →
convert_from_pathuses a lot of memory. Setdpi=150for first pass, re-run at 300 only if OCR quality is poor.
Dependencies
Standard install for paideia use:
pip install --break-system-packages pypdf pdfplumber pytesseract pdf2image reportlab
apt-get install -y poppler-utils tesseract-ocr tesseract-ocr-kor
The Korean language pack (tesseract-ocr-kor) is needed if the user writes solutions in Korean/Hangul.
Reference
Full skill at https://github.com/anthropics/skills/tree/main/skills/pdf with REFERENCE.md covering pypdfium2, JavaScript libraries, and FORMS.md covering PDF form filling.
Related skills
Clinical Case Report
nexu-io
Generate structured medical case presentations in SOAP or narrative format for clinical documentation.
Nihaisha TCM Course Guide
JuneYaooo
Find study materials and references from Ni Haisha's traditional Chinese medicine courses.
ALFWorld Clean Object
zjunlp
Clean items in your inventory using available washing stations.
Heat Object with Appliance
zjunlp
Heat or cook an object using a kitchen appliance like a microwave or oven.