Paper Writing Bench

Name: Paper Writing Bench
Author: Ar9av

By Ar9av· Ar9av/PaperOrchestra· 0

Extract sparse ideas, dense ideas, and experimental logs from research papers for benchmarking.

Installation

1
Make sure Claude is on your device and in your terminal.
Skills load from ~/.claude/skills/ when Claude Code starts up — so you need it on your machine first. If you don't have it yet, install it once with the command below, then run claude in any terminal to verify.
One-time setup
```
npm i -g @anthropic-ai/claude-code
```
Already have it? Skip ahead.

Paste into Claude Code or into your terminal.

Install

git clone ht••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••• •• ••••• •• •••••••••••••••••••••••••••••••••••••••••• •• •• •• ••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••

This copies the whole skill folder into ~/.claude/skills/paper-writing-bench-ar9av/ — the SKILL.md plus any scripts, reference docs, or templates the skill ships with. Safe default: works for every skill.

Faster alternative (instruction-only skills)

Skips the clone and grabs only the SKILL.md file. Don't use this if the skill ships Python scripts, reference markdowns, or asset templates — they won't be downloaded and the skill will fail when it tries to load them.

Quick install (SKILL.md only)

mkdir -p ~/.••••••••••••••••••••••••••••••••••••••• •• •••• ••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• •••••••••••••••••••••••••••••••••••••••••••••••••••

Restart Claude Code.
Quit and reopen Claude Code (or any other agent that loads from ~/.claude/skills/). New skills are picked up on startup.
Just ask Claude.
Skills auto-activate when your request matches the skill's description — no slash command needed. Trigger phrases live in the skill's own frontmatter; you can read them in the “What this skill does” section above.

Prefer to read the source first? Open on GitHub.

When Claude uses it

Reverse-engineer raw materials (Sparse idea, Dense idea, experimental log) from an existing AI research paper to build a benchmark case for evaluating paper-writing pipelines. Replicates the PaperWritingBench dataset construction procedure from arXiv:2604.05018 §3 / App. C. TRIGGER when the user asks to "build a benchmark case from this paper", "reverse-engineer raw materials", or "evaluate my pipeline against PaperWritingBench".

What this skill does

PaperWritingBench (§3)

Faithful implementation of the PaperWritingBench dataset construction procedure from PaperOrchestra (Song et al., 2026, arXiv:2604.05018, §3 and App. C, F.2).

The original benchmark contains 200 papers (100 CVPR 2025 + 100 ICLR 2025). For each paper, the authors reverse-engineer the (I, E) tuple by stripping narrative flow from the original PDF using the three prompts in App. F.2. You can use this skill to reverse-engineer your own benchmark cases from any paper PDF.

What this skill does

Given an existing AI research paper (PDF or markdown extract), produce:

idea.md (Sparse variant) — high-level concept note, no math, no experimental results
idea.md (Dense variant) — detailed technical proposal with LaTeX equations and variable definitions, but still no experimental results
experimental_log.md — exhaustive raw experimental setup, numeric data, and qualitative observations, with all narrative references stripped

These three files form a complete (I, E) input pair for the paper-orchestra pipeline. You can then run the pipeline and compare its output to the original paper using paper-autoraters.

Inputs

A paper PDF or extracted markdown text. The paper uses MinerU (Wang et al., 2024) for PDF→markdown extraction; you (the host agent) should use whatever PDF extractor your environment provides.
For controlled experiments, you may also extract figures separately (PDFFigures 2.0 in the paper).

Outputs

bench/<paper_id>/idea_sparse.md — Sparse variant
bench/<paper_id>/idea_dense.md — Dense variant
bench/<paper_id>/experimental_log.md — Experimental log

Workflow

For each paper, run three independent LLM calls using the verbatim prompts below:

1. Sparse idea generation

Load references/sparse-idea-prompt.md. Pass the paper text (or markdown extract) as {paper_content}. The prompt instructs the model to:

Stop extracting at empirical verification (no Experiments / Results / Comparisons)
Use first-person future tense ("We propose to explore...")
Avoid LaTeX math; describe components by function
Anonymize authors and titles

Output: idea_sparse.md with the four sections (Problem Statement, Core Hypothesis, Proposed Methodology high-level, Expected Contribution).

2. Dense idea generation

Load references/dense-idea-prompt.md. Same input. The prompt instructs the model to:

Preserve mathematical formulations using LaTeX
Define every variable used in equations
Include specific architectural choices and dimensions
Same exclusion zone (no experiments)

Output: idea_dense.md with the four sections (Problem Statement, Core Hypothesis, Proposed Methodology detailed, Expected Contribution).

3. Experimental log generation

Load references/experimental-log-prompt.md. Same input. The prompt instructs the model to:

Use past-tense persona ("We ran...", "The results were...")
Strip all references to figure/table numbers
Deconstruct tables into raw numeric data
Log figure findings as factual observations
Anonymize authors

Output: experimental_log.md with sections for Setup, Raw Numeric Data, and Qualitative Observations.

Critical rules from the prompts

These are excerpted from App. F.2. The host agent MUST honor them:

No citations. None of the three outputs may contain \cite, reference numbers, or author names from the source paper.
No URLs. Strip all hyperlinks.
Anonymize. Author identities, affiliations, acknowledgements all removed.
Self-contained. Each file must make sense without the original paper.
No experimental leakage in idea files. The Sparse and Dense ideas must stop where empirical verification begins. They describe what will be done, not what was done.
No table/figure references in experimental log. No "as shown in Table 1", "see Fig. 5". The downstream paper-orchestra pipeline will generate its own figures and tables — the log must not assume any particular ones exist.
100% numeric accuracy in experimental log. This becomes the ground truth for the section-writing-agent and content-refinement-agent's hallucination check.

How the bench is used

After producing (idea_sparse.md, idea_dense.md, experimental_log.md) for a paper:

Pick a variant (Sparse or Dense) — the paper ablates both, with Dense producing more rigorous methodology and Sparse exercising the system's robustness on under-specified inputs.
Drop the chosen idea.md, plus experimental_log.md, plus a template.tex for the target conference, plus a conference_guidelines.md, into a paper-orchestra workspace.
Run the pipeline.
Compare the generated paper against the original using paper-autoraters (citation F1, lit review quality, SxS paper quality).

Resources

references/bench-overview.md — the 200-paper bench, venue cutoffs, sizes
references/sparse-idea-prompt.md — verbatim from App. F.2
references/dense-idea-prompt.md — verbatim from App. F.2
references/experimental-log-prompt.md — verbatim from App. F.2

Related skills

Claude API Helper

anthropics

Build, debug, and optimize Claude API applications with caching and model migration support.

OfficialComplete terms in LICENSE.txt

Customer Health Scorer

alirezarezvani

Analyze customer accounts to predict churn risk and identify expansion opportunities.

MIT

CLAUDE.md Optimizer

daymade

Optimize your CLAUDE.md file for clarity, efficiency, and maintainability.

Phase Knowledge Quiz

rohitg00

Test your understanding of AI Engineering from Scratch course phases.