Tool quality results
Level 1 tests whether a model produces valid tool calls with correct arguments. No documents are executed β only the decision is evaluated. Each of the 31 tests runs twice per model: once with the full SuperDoc system prompt, once with a minimal two-line prompt.| Model | Full prompt | Minimal prompt | Gap |
|---|---|---|---|
| GPT-5.4 (OpenAI) | 100% (31/31) | 74% (23/31) | -26 |
| GPT-4o (OpenAI) | 97% (30/31) | 42% (13/31) | -55 |
| Amazon Nova 2.0-Lite (AWS) | 94% (29/31) | 74% (23/31) | -20 |
| Gemini 2.5 Flash (Google) | 94% (29/31) | 61% (19/31) | -33 |
| Claude Haiku 4.5 (Anthropic) | 90% (28/31) | 77% (24/31) | -13 |
| GPT-4.1-mini (OpenAI) | 90% (28/31) | 61% (19/31) | -29 |
End-to-end results
Level 2 runs 21 tests against real.docx fixtures. Each test executes a full agent loop: open the document, let the LLM pick tools, execute them via the CLI, and verify the output file changed correctly.
Three models were tested end-to-end: GPT-5.4, Claude Haiku 4.5, and Gemini 2.5 Pro.
Tests span 10 categories:
- Reading β text search, node search, full extraction, counting
- Mutation β replace, delete, multi-step atomic edits, first-occurrence targeting
- Formatting β bold, highlight, underline via
format.apply - Structure β headings, paragraphs via
creategroup - Tables β table discovery and creation
- Comments β comment insertion and discovery
- Tracked changes β
trackedvs.directchange modes - Lists β list discovery and creation
- Hygiene / efficiency β no hallucinated parameters, undo usage, call count limits
- Aspirational β page margins, TOC, cell merging, hyperlinks, liability analysis
Per-category breakdowns by model will be published as the eval suite matures. The current results validate that all three end-to-end models handle the core categories (reading, mutation, formatting) reliably.
The prompt gap
The βgapβ column measures how much a model depends on detailed prompt engineering. A smaller gap means the model interprets SuperDocβs tool definitions well on its own.- The full system prompt lifts every model by 13β55 percentage points.
- GPT-5.4 is the only model that reaches 100% with the full prompt.
- Claude Haiku 4.5 has the smallest gap (-13 points) β it degrades least when the prompt is sparse. This makes it the most robust choice if you need to trim the system prompt for token budget reasons.
- GPT-4o is the most prompt-dependent (55-point gap). It scores 97% with the full prompt but drops to 42% without it.
Methodology
Results come from a Promptfoo-based eval framework running against SuperDocβs grouped intent tool set β 9 meta-tools that cover all 360+ underlying operations.- Level 1 (tool quality): 31 synthetic test cases Γ 2 prompts Γ 6 models = 372 assertions. Gemini 2.5 Pro was tested only in Level 2. A test passes when the model produces a structurally valid tool call with correct arguments. Assertion types include
tool-call-f1(set comparison), custom JavaScript checks, and latency thresholds. - Level 2 (end-to-end): 21 tests across 3 real
.docxfixture documents. Each test runs a full agent loop and checks the resulting document content withcontains/not-containsassertions. - Models tested: GPT-5.4, GPT-4o, GPT-4.1-mini (OpenAI); Claude Haiku 4.5 (Anthropic); Gemini 2.5 Flash, Gemini 2.5 Pro (Google); Amazon Nova 2.0-Lite (AWS).
- Configuration: temperature 0, seed 42, tool_choice required, 30s timeout (Level 1) / 120s timeout (Level 2).
- Eval date: March 2026.
Related
LLM tools
Tool definitions, dispatch, and the system prompt reference
Available operations
All Document API operations by namespace
Skills
Reusable prompt templates for common document editing tasks
Integrations
Connect to AWS Bedrock, Vercel AI SDK, and other providers

