AugmentClaude

Paper Writing Bench

Extract sparse ideas, dense ideas, and experimental logs from research papers for benchmarking.

Installation

  1. Make sure Claude is on your device and in your terminal.

    Skills load from ~/.claude/skills/ when Claude Code starts up — so you need it on your machine first. If you don't have it yet, install it once with the command below, then run claude in any terminal to verify.

    One-time setup
    npm i -g @anthropic-ai/claude-code

    Already have it? Skip ahead.

  2. Paste into Claude Code or into your terminal.

    This copies the whole skill folder into ~/.claude/skills/paper-writing-bench-ar9av/ — the SKILL.md plus any scripts, reference docs, or templates the skill ships with. Safe default: works for every skill.

    Faster alternative (instruction-only skills)

    Skips the clone and grabs only the SKILL.md file. Don't use this if the skill ships Python scripts, reference markdowns, or asset templates — they won't be downloaded and the skill will fail when it tries to load them.

    Quick install (SKILL.md only)
    Sign up to copy
  3. Restart Claude Code.

    Quit and reopen Claude Code (or any other agent that loads from ~/.claude/skills/). New skills are picked up on startup.

  4. Just ask Claude.

    Skills auto-activate when your request matches the skill's description — no slash command needed. Trigger phrases live in the skill's own frontmatter; you can read them in the “What this skill does” section above.

Prefer to read the source first? Open on GitHub.

When Claude uses it

Reverse-engineer raw materials (Sparse idea, Dense idea, experimental log) from an existing AI research paper to build a benchmark case for evaluating paper-writing pipelines. Replicates the PaperWritingBench dataset construction procedure from arXiv:2604.05018 §3 / App. C. TRIGGER when the user asks to "build a benchmark case from this paper", "reverse-engineer raw materials", or "evaluate my pipeline against PaperWritingBench".

What this skill does

PaperWritingBench (§3)

Faithful implementation of the PaperWritingBench dataset construction procedure from PaperOrchestra (Song et al., 2026, arXiv:2604.05018, §3 and App. C, F.2).

The original benchmark contains 200 papers (100 CVPR 2025 + 100 ICLR 2025). For each paper, the authors reverse-engineer the (I, E) tuple by stripping narrative flow from the original PDF using the three prompts in App. F.2. You can use this skill to reverse-engineer your own benchmark cases from any paper PDF.

What this skill does

Given an existing AI research paper (PDF or markdown extract), produce:

  • idea.md (Sparse variant) — high-level concept note, no math, no experimental results
  • idea.md (Dense variant) — detailed technical proposal with LaTeX equations and variable definitions, but still no experimental results
  • experimental_log.md — exhaustive raw experimental setup, numeric data, and qualitative observations, with all narrative references stripped

These three files form a complete (I, E) input pair for the paper-orchestra pipeline. You can then run the pipeline and compare its output to the original paper using paper-autoraters.

Inputs

  • A paper PDF or extracted markdown text. The paper uses MinerU (Wang et al., 2024) for PDF→markdown extraction; you (the host agent) should use whatever PDF extractor your environment provides.
  • For controlled experiments, you may also extract figures separately (PDFFigures 2.0 in the paper).

Outputs

  • bench/<paper_id>/idea_sparse.md — Sparse variant
  • bench/<paper_id>/idea_dense.md — Dense variant
  • bench/<paper_id>/experimental_log.md — Experimental log

Workflow

For each paper, run three independent LLM calls using the verbatim prompts below:

1. Sparse idea generation

Load references/sparse-idea-prompt.md. Pass the paper text (or markdown extract) as {paper_content}. The prompt instructs the model to:

  • Stop extracting at empirical verification (no Experiments / Results / Comparisons)
  • Use first-person future tense ("We propose to explore...")
  • Avoid LaTeX math; describe components by function
  • Anonymize authors and titles

Output: idea_sparse.md with the four sections (Problem Statement, Core Hypothesis, Proposed Methodology high-level, Expected Contribution).

2. Dense idea generation

Load references/dense-idea-prompt.md. Same input. The prompt instructs the model to:

  • Preserve mathematical formulations using LaTeX
  • Define every variable used in equations
  • Include specific architectural choices and dimensions
  • Same exclusion zone (no experiments)

Output: idea_dense.md with the four sections (Problem Statement, Core Hypothesis, Proposed Methodology detailed, Expected Contribution).

3. Experimental log generation

Load references/experimental-log-prompt.md. Same input. The prompt instructs the model to:

  • Use past-tense persona ("We ran...", "The results were...")
  • Strip all references to figure/table numbers
  • Deconstruct tables into raw numeric data
  • Log figure findings as factual observations
  • Anonymize authors

Output: experimental_log.md with sections for Setup, Raw Numeric Data, and Qualitative Observations.

Critical rules from the prompts

These are excerpted from App. F.2. The host agent MUST honor them:

  • No citations. None of the three outputs may contain \cite, reference numbers, or author names from the source paper.
  • No URLs. Strip all hyperlinks.
  • Anonymize. Author identities, affiliations, acknowledgements all removed.
  • Self-contained. Each file must make sense without the original paper.
  • No experimental leakage in idea files. The Sparse and Dense ideas must stop where empirical verification begins. They describe what will be done, not what was done.
  • No table/figure references in experimental log. No "as shown in Table 1", "see Fig. 5". The downstream paper-orchestra pipeline will generate its own figures and tables — the log must not assume any particular ones exist.
  • 100% numeric accuracy in experimental log. This becomes the ground truth for the section-writing-agent and content-refinement-agent's hallucination check.

How the bench is used

After producing (idea_sparse.md, idea_dense.md, experimental_log.md) for a paper:

  1. Pick a variant (Sparse or Dense) — the paper ablates both, with Dense producing more rigorous methodology and Sparse exercising the system's robustness on under-specified inputs.
  2. Drop the chosen idea.md, plus experimental_log.md, plus a template.tex for the target conference, plus a conference_guidelines.md, into a paper-orchestra workspace.
  3. Run the pipeline.
  4. Compare the generated paper against the original using paper-autoraters (citation F1, lit review quality, SxS paper quality).

Resources

  • references/bench-overview.md — the 200-paper bench, venue cutoffs, sizes
  • references/sparse-idea-prompt.md — verbatim from App. F.2
  • references/dense-idea-prompt.md — verbatim from App. F.2
  • references/experimental-log-prompt.md — verbatim from App. F.2

Related skills