This repository benchmarks the cost and tool-use efficiency of answering documentation questions with two different retrieval strategies:
- Context7: Claude Code is restricted to the Context7 MCP tools.
- Web search: Claude Code is restricted to
WebSearchandWebFetch.
For each question, the runner records the generated response, total estimated cost, token counts, and tool calls.
The benchmark runner is src/claudeCode.ts.
For a selected question set, it:
- Loads questions from
questions/<question-set>.txt. - Runs every question once using Context7.
- Waits 5.5 minutes so Anthropic prompt-cache effects do not carry into the next run.
- Runs every question once using web search.
- Writes per-question results and an aggregate summary to
results/.
.
├── questions/
│ ├── questions1.txt
│ ├── questions2.txt
│ ├── questions3.txt
│ ├── questions4.txt
│ └── questions5.txt
├── src/
│ ├── claudeCode.ts
│ └── types.ts
├── package.json
└── tsconfig.json
Generated results are written to:
results/
├── context7/<question-set>.json
└── search/<question-set>.json
Install dependencies:
npm installCreate a .env file in the repository root:
ANTHROPIC_SEARCH_API_KEY=your_anthropic_key_for_search_runs
ANTHROPIC_C7_API_KEY=your_anthropic_key_for_context7_runs
CONTEXT7_API_KEY=your_context7_api_keyTwo Anthropic key variables are used so search and Context7 runs can be tracked or billed separately. They may point to the same key if you do not need that separation.
Run one of the five question sets:
Valid question sets are:
questions1
questions2
questions3
questions4
questions5
Example:
npm run benchmark -- questions5Each result file is a JSON object keyed by the exact question text. Each question record contains:
{
"generatedResponse": "The model response text...",
"totalCost": 0,
"inputTokens": 0,
"outputTokens": 0,
"cacheCreationInputTokens": 0,
"cacheReadInputTokens": 0,
"totalTokens": 0,
"toolCallCount": 0,
"toolNames": []
}Each file also includes a summary record:
{
"summary": {
"type": "summary",
"questionCount": 20,
"averages": {
"totalCost": 0,
"inputTokens": 0,
"outputTokens": 0,
"cacheCreationInputTokens": 0,
"cacheReadInputTokens": 0,
"totalTokens": 0,
"toolCallCount": 0
}
}
}- Results depend on the current model, SDK behavior, and API pricing.
- The runner waits 5.5 minutes between Context7 and search runs to reduce prompt-cache carryover, but provider-side behavior can still affect measurements.
- Existing result files are updated in place. Re-running the same question set overwrites records for matching question text.