Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -859,12 +859,14 @@ The following sets of tools are available:
- **list_issue_types** - List available issue types
- **Required OAuth Scopes**: `read:org`
- **Accepted OAuth Scopes**: `admin:org`, `read:org`, `write:org`
- `fields`: Subset of issue type fields to return for each issue type. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
- `owner`: The organization owner of the repository (string, required)

- **list_issues** - List issues
- **Required OAuth Scopes**: `repo`
- `after`: Cursor for pagination. Use the endCursor from the previous page's PageInfo for GraphQL APIs. (string, optional)
- `direction`: Order direction. If provided, the 'orderBy' also needs to be provided. (string, optional)
- `fields`: Subset of issue fields to return for each issue. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
- `labels`: Filter by labels (string[], optional)
- `orderBy`: Order issues by field. If provided, the 'direction' also needs to be provided. (string, optional)
- `owner`: Repository owner (string, required)
Expand All @@ -875,6 +877,7 @@ The following sets of tools are available:

- **search_issues** - Search issues
- **Required OAuth Scopes**: `repo`
- `fields`: Subset of issue fields to return for each result. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
- `order`: Sort order (string, optional)
- `owner`: Optional repository owner. If provided with repo, only issues for this repository are listed. (string, optional)
- `page`: Page number for pagination (min 1) (number, optional)
Expand Down Expand Up @@ -1074,6 +1077,7 @@ The following sets of tools are available:
- **Required OAuth Scopes**: `repo`
- `base`: Filter by base branch (string, optional)
- `direction`: Sort direction (string, optional)
- `fields`: Subset of pull request fields to return for each pull request. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
- `head`: Filter by head user/org and branch (string, optional)
- `owner`: Repository owner (string, required)
- `page`: Page number for pagination (min 1) (number, optional)
Expand Down Expand Up @@ -1123,6 +1127,7 @@ The following sets of tools are available:

- **search_pull_requests** - Search pull requests
- **Required OAuth Scopes**: `repo`
- `fields`: Subset of pull request fields to return for each result. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
- `order`: Sort order (string, optional)
- `owner`: Optional repository owner. If provided with repo, only pull requests for this repository are listed. (string, optional)
- `page`: Page number for pagination (min 1) (number, optional)
Expand Down Expand Up @@ -1232,6 +1237,7 @@ The following sets of tools are available:

- **list_branches** - List branches
- **Required OAuth Scopes**: `repo`
- `fields`: Subset of branch fields to return for each branch. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
- `owner`: Repository owner (string, required)
- `page`: Page number for pagination (min 1) (number, optional)
- `perPage`: Results per page for pagination (min 1, max 100) (number, optional)
Expand All @@ -1240,6 +1246,7 @@ The following sets of tools are available:
- **list_commits** - List commits
- **Required OAuth Scopes**: `repo`
- `author`: Author username or email address to filter commits by (string, optional)
- `fields`: Subset of commit fields to return for each commit. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
- `owner`: Repository owner (string, required)
- `page`: Page number for pagination (min 1) (number, optional)
- `path`: Only commits containing this file path will be returned (string, optional)
Expand All @@ -1251,13 +1258,15 @@ The following sets of tools are available:

- **list_releases** - List releases
- **Required OAuth Scopes**: `repo`
- `fields`: Subset of release fields to return for each release. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
- `owner`: Repository owner (string, required)
- `page`: Page number for pagination (min 1) (number, optional)
- `perPage`: Results per page for pagination (min 1, max 100) (number, optional)
- `repo`: Repository name (string, required)

- **list_tags** - List tags
- **Required OAuth Scopes**: `repo`
- `fields`: Subset of tag fields to return for each tag. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
- `owner`: Repository owner (string, required)
- `page`: Page number for pagination (min 1) (number, optional)
- `perPage`: Results per page for pagination (min 1, max 100) (number, optional)
Expand Down
8 changes: 8 additions & 0 deletions evals/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
.venv/
__pycache__/
out/
.env
.env.local
# Captured fixtures contain real (large) GitHub data; keep only the example committed.
fixtures/*.json
!fixtures/example_*.json
206 changes: 206 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
# Phase 1 evals: context cost vs. benefit of output schemas + field filtering

Small, deterministic scripts to get the numbers we care about, compared across
**three shippable configurations**:

| Scenario | output schema | `fields` param | what it represents |
|----------|:---:|:---:|--------------------|
| **S1 baseline** | ✗ | ✗ | today's behavior (no experiment) |
| **S2 schema+fields** | ✓ | ✓ | the full experiment |
| **S3 fields-only** | ✗ | ✓ | hypothesized sweet spot: the model can filter without carrying the heavy schema |

The intuition behind **S3**: the model doesn't need the full output schema to
filter — it just needs the `fields` param (whose enum already tells it which
fields exist). So S3 may capture almost all of the benefit at a fraction of the
fixed cost.

The two numbers we derive:

1. **Fixed tax** — extra tokens added to the `tools/list` payload (paid once at
client init) by each scenario.
2. **Per-call savings** — tokens saved when the model filters a tool response to
a subset of fields.

From those: **break-even calls = fixed_tax / avg_savings_per_call**, computed per
scenario.

No LLM required for the offline numbers (tokenization falls back to a chars/4
proxy if `tiktoken` isn't installed). The online A/B (step 4) runs the three
scenarios through a real model over realistic multi-tool sessions.

## Setup

```bash
cd evals
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt # tiktoken (+ anthropic for the online A/B)
```

### Token & secrets

Live tool calls and the online A/B need a real GitHub token. **Never hardcode or
commit it.** Provide it via the environment only; `.env*` is gitignored:

```bash
export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxx # read-only scope is enough
```

A dummy token is used automatically for `capture_tools.py` (it never calls the API).


## 1) Fixed tax

Capture the tool list WITH the experiment enabled, then let the analyzer derive
all three scenarios by stripping `outputSchema` and/or the `fields` property:

```bash
python3 capture_tools.py --features output_schemas --toolsets all \
--out out/tools.treatment.json

python3 fixed_tax.py --tools out/tools.treatment.json --json-out out/fixed_tax.json
# add --approx if offline without tiktoken
```

`fixed_tax.py` prints the payload tokens for each scenario (S1/S2/S3), the fixed
tax of each vs the S1 baseline, a component breakdown (schema vs `fields`), and a
per-tool breakdown. `--json-out` writes the per-scenario taxes for step 3.

> Tip: measure with the `--toolsets` you'd actually ship. The tax is fixed in
> absolute tokens but its *relative* size shrinks the more tools you expose.

## 2) Per-call savings (real data)

Capture real full vs filtered responses for the affected tools, straight from
live GitHub (read-only), then token-diff them:

```bash
export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxx
python3 capture_fixtures.py --owner github --repo github-mcp-server --org github
python3 response_savings.py --fixed-tax-json out/fixed_tax.json
```

`capture_fixtures.py` calls each tool twice (no `fields` vs a small subset) and
writes `fixtures/<tool>.full.json` / `.filtered.json`. Use a busy repo for a
realistic upper bound, **and** a small repo for the lower bound — report a range.
An example pair is included so the script also runs offline.

## 3) Break-even (per scenario)

`response_savings.py --fixed-tax-json out/fixed_tax.json` prints break-even for
both taxed scenarios:

```
break_even_calls = scenario_fixed_tax / avg_saved_per_call
```

Interpretation:
- A session with **more** filtered list/search calls than `break_even_calls` is
net-positive on context for that scenario.
- **S3 (fields-only)** has a far smaller tax than S2, so its break-even is tiny —
this is the configuration to scrutinize first.
- Short sessions (few tool calls) are where the fixed tax dominates — call this
out in the writeup.

## 4) Online A/B (Phase 2 — real multi-tool sessions, all 3 scenarios)

Runs the same tasks through a real model across all three scenarios, measuring
cumulative prompt tokens. This is the only way to confirm the model actually
*uses* `fields` and to get the true net effect — including whether S3 really is
the sweet spot.

**Use a model with a real context window.** The harness talks to any
**OpenAI-compatible** endpoint, so you don't need a paid third-party key:

- **GitHub Models** (default) — authenticated with your GitHub token, no extra
key. Convenient, but the free tier caps requests at **16,000 tokens**, so large
unfiltered responses error out (`413`). Fine for a smoke test; **not** for the
headline numbers.
- **A Copilot / internal proxy** — point `--base-url` at any OpenAI-compatible
endpoint you already have access to and pass its token via `--api-key-env`. This
is how to run a large-context model (e.g. `claude-opus-4-6`) with no request cap
and no out-of-pocket billing.

```bash
export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxx # always: the MCP server uses this

# Smoke test: GitHub Models gpt-5 (16k cap — expect overflow failures on big repos)
python3 schema_fields_eval.py --model openai/gpt-5 --toolsets issues,pull_requests --repeat 3

# Recommended: a large-context model via your OpenAI-compatible endpoint
export COPILOT_TOKEN=... # whatever token that endpoint needs
python3 schema_fields_eval.py \
--base-url https://your-openai-compatible-endpoint/v1 \
--api-key-env COPILOT_TOKEN \
--model claude-opus-4-6 --toolsets issues,pull_requests --repeat 3

# --base-url <url> any OpenAI-compatible endpoint
# --api-key-env VAR env var holding that endpoint's token
# --repo owner/repo target repo for the tasks (default cli/cli; see below)
# --tasks-file mytasks.txt one task per line, optionally 'tag<TAB>task'
```

> Target a readable repo: the tasks run against a **live** repo, so point `--repo`
> at a large, **public, SAML-free** repo a plain PAT can read (default `cli/cli`).
> If you aim at a SAML-protected org repo (e.g. `github/github-mcp-server`), every
> call 403s, the model only ever sees tiny error payloads, and the `fields` arms
> look like pure overhead because there's nothing to filter — the experiment then
> measures only the fixed schema/param tax, not the filtering payoff. Such runs
> are now flagged as failures (a tool returning `isError`) and excluded from the
> token comparison rather than silently counted as valid.

The server is booted with `--features output_schemas` so the **S2** arm has a real
schema to embed; the `fields` param and server-side filtering are present in every
arm regardless, so only what each arm shows the *model* differs.

It prints, per scenario: cumulative prompt/completion tokens, tool-call counts,
`fields` adoption, the net delta vs the S1 baseline, and a **per-task-type
breakdown** (narrow / full / neutral) so you can see *where* each config helps.
Only task-runs where all three arms succeeded count toward the token comparison
(so on the capped GitHub Models endpoint, the biggest filtering wins — tasks where
the unfiltered baseline overflowed — show up in the failure counts, not the token
table; another reason to use a large-context endpoint). Use `--repeat >= 3` to
average out model nondeterminism. Per-run detail is written to
`out/schema_fields_eval.jsonl`.

> Task design matters: the default tasks are intentionally **neutral** (they do
> not tell the model to "return only X"). Biasing prompts toward terse answers
> would inflate the filtering arms. Keep a balanced mix of narrow/full/neutral.

> Cost control: the default toolsets are narrow on purpose. The relevant
> differences live in the affected tools, so you don't need all 79 tools loaded
> each turn. Use `fixed_tax.py` (all toolsets) for the init-tax number and the
> online run for the savings/net dynamic.



## Honesty notes

- Tokenizers differ across providers; report **deltas** and state the tokenizer.
- Step 2 assumes the model actually uses `fields`. That adoption rate can only be
confirmed by the Phase 2 online A/B — Phase 1 is an upper bound on benefit.
- Real response sizes vary a lot by repo; capture fixtures from both a small and a
large/busy repo and report a range, not a single number.
- The `fields` param and the server-side filtering are **not** gated by the
`output_schemas` feature flag in the server — only the `outputSchema` and the
response's `structuredContent` are. So S1 (baseline) here means "pre-experiment
main", and "flag off" in production today would still ship the `fields` param.
Reconcile the scenario you measure with the toggle you'd actually ship.
- With output schemas on, each tool result also carries a `structuredContent`
duplicate of the payload. The online A/B forwards only the text content to the
model (so all arms see identical response bytes); a client that also feeds
`structuredContent` to the model would pay more in the S2 arm. State this
assumption when you report.

## Files

| File | Purpose |
|------|---------|
| `capture_tools.py` | Boot server over stdio, dump `tools/list` result |
| `fixed_tax.py` | Per-scenario token-diff (S1/S2/S3); `--json-out` for break-even |
| `capture_fixtures.py` | Capture real full/filtered tool responses (live GitHub) |
| `response_savings.py` | Token-diff full vs filtered responses; per-scenario break-even |
| `schema_fields_eval.py` | 3-scenario (A/B/C) multi-tool agent eval, prompt-token accounting |
| `_mcp_client.py` | Shared MCP stdio client |
| `_tokenize.py` | Tokenizer helper (tiktoken or chars/4 fallback) |
| `fixtures/` | Response pairs (example + captured) |

Loading
Loading