paper2code: auditing ambiguity in ML paper code generation with citation-anchored implementations

Every time you try to reproduce a machine learning paper from its PDF, you know the drill: missing details, vague descriptions, and the dreaded guesswork. paper2code tackles this problem head-on by not only generating code from arxiv paper URLs but also auditing every implementation choice for ambiguity. Instead of silently filling gaps or hallucinating details, it explicitly flags what’s unspecified, embedding citations and alternatives inline. This repo is worth a look if you care about traceable, citation-anchored ML implementations that expose rather than hide uncertainty.

What paper2code does and how it’s architected

paper2code is a Claude Code skill — an agent plugin designed to transform arxiv paper URLs into structured Python projects that implement the paper’s models. This isn’t your typical LLM code generation that guesses missing details; instead, it performs a thorough ambiguity audit before writing any code.

The system classifies every implementation choice into SPECIFIED, PARTIALLY_SPECIFIED, or UNSPECIFIED categories. This classification is then reflected as inline comments in the generated code, anchored to the relevant sections or equations of the paper. The output is a well-organized project directory including files like model.py, loss.py, train.py, a configs/base.yaml configuration file, and a Jupyter notebook walkthrough.

Under the hood, paper2code acts as an agent skill within the Claude Code environment. It leverages the agent’s ability to parse the paper, identify ambiguities, and generate code that is both traceable and reproducible. The citation anchoring connects code snippets directly back to the source paper, which is critical for validation and debugging in research reproduction.

The ambiguity audit system: a technical deep dive

What sets paper2code apart is its explicit auditing of ambiguity before any code is generated. Most LLM-driven code generators for ML papers tend to fill missing details silently, often leading to implementations that diverge from the paper’s intention without the user realizing it.

paper2code trades completeness for transparency. Every uncertain or unspecified design decision is marked with an [UNSPECIFIED] comment in the code, listing possible alternatives. This forces users to confront these gaps rather than pretend they don’t exist. For instance, if the paper doesn’t specify an optimizer or learning rate schedule, the generated train.py will include a commented list of plausible choices rather than assuming defaults.

This audit system breaks down choices into three clear buckets:

SPECIFIED: The paper clearly defines this aspect, and the code reflects it verbatim.
PARTIALLY_SPECIFIED: Some information is given but leaves room for interpretation.
UNSPECIFIED: The paper does not specify this, so paper2code flags it and lists alternatives.

This approach is visible directly in the code, improving developer experience and making the reproduction process more honest. It also helps avoid the common pitfall of silent assumptions that plague many ML repos.

The tradeoff is obvious: the output is not guaranteed to run out-of-the-box or match baseline results exactly. Instead, it prioritizes traceability and invites manual intervention where necessary.

Quick start

To install paper2code as a Claude Code skill, run the following commands exactly as shown:

npx skills add PrathamLearnsToCode/paper2code/skills/paper2code

You’ll be prompted to:

Select agents — pick the coding agents you want to use this skill with (e.g., Claude Code)
Choose scope — Global (recommended) or project-level
Choose method — Symlink (recommended) or copy

Once installed, open your agent and run the skill:

claude  # or your preferred agent

This will enable you to input arxiv URLs and receive a structured, citation-anchored Python project implementing the paper with ambiguity annotations.

verdict: who benefits from paper2code

paper2code is a tool for researchers, ML engineers, and practitioners who want to reproduce or build upon academic papers with a clear understanding of what is explicitly defined and what isn’t. It’s especially useful in research settings where traceability and auditability trump turnkey solutions.

The repo’s explicit refusal to silently fill in blanks is both its strength and limitation. Users looking for a plug-and-play codebase might find the [UNSPECIFIED] flags inconvenient or requiring extra manual work. But those interested in honest, transparent implementations that hold up under scrutiny will appreciate the tradeoff.

From a practitioner’s perspective, paper2code surfaces the often-hidden ambiguity in ML papers, making it easier to identify where you need to experiment or consult the original authors. It’s less about shortcutting development and more about improving the foundation for reproducible ML research.

If your workflow involves reproducing academic models or auditing ML code against papers, paper2code offers a fresh, structured approach worth exploring.

→ GitHub Repo: PrathamLearnsToCode/paper2code ⭐ 1,225 · Python

Noureddine RAMDI / paper2code: auditing ambiguity in ML paper code generation with citation-anchored implementations

What paper2code does and how it’s architected

The ambiguity audit system: a technical deep dive

Quick start

verdict: who benefits from paper2code