github · tommaso-moro · Jun 9, 2026 · Jun 10, 2026 · Jun 16, 2026
@@ -859,12 +859,14 @@ The following sets of tools are available:
 - **list_issue_types** - List available issue types
   - **Required OAuth Scopes**: `read:org`
   - **Accepted OAuth Scopes**: `admin:org`, `read:org`, `write:org`
+  - `fields`: Subset of issue type fields to return for each issue type. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
   - `owner`: The organization owner of the repository (string, required)
 
 - **list_issues** - List issues
   - **Required OAuth Scopes**: `repo`
   - `after`: Cursor for pagination. Use the endCursor from the previous page's PageInfo for GraphQL APIs. (string, optional)
   - `direction`: Order direction. If provided, the 'orderBy' also needs to be provided. (string, optional)
+  - `fields`: Subset of issue fields to return for each issue. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
   - `labels`: Filter by labels (string[], optional)
   - `orderBy`: Order issues by field. If provided, the 'direction' also needs to be provided. (string, optional)
   - `owner`: Repository owner (string, required)
@@ -875,6 +877,7 @@ The following sets of tools are available:
 
 - **search_issues** - Search issues
   - **Required OAuth Scopes**: `repo`
+  - `fields`: Subset of issue fields to return for each result. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
   - `order`: Sort order (string, optional)
   - `owner`: Optional repository owner. If provided with repo, only issues for this repository are listed. (string, optional)
   - `page`: Page number for pagination (min 1) (number, optional)
@@ -1074,6 +1077,7 @@ The following sets of tools are available:
   - **Required OAuth Scopes**: `repo`
   - `base`: Filter by base branch (string, optional)
   - `direction`: Sort direction (string, optional)
+  - `fields`: Subset of pull request fields to return for each pull request. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
   - `head`: Filter by head user/org and branch (string, optional)
   - `owner`: Repository owner (string, required)
   - `page`: Page number for pagination (min 1) (number, optional)
@@ -1123,6 +1127,7 @@ The following sets of tools are available:
 
 - **search_pull_requests** - Search pull requests
   - **Required OAuth Scopes**: `repo`
+  - `fields`: Subset of pull request fields to return for each result. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
   - `order`: Sort order (string, optional)
   - `owner`: Optional repository owner. If provided with repo, only pull requests for this repository are listed. (string, optional)
   - `page`: Page number for pagination (min 1) (number, optional)
@@ -1232,6 +1237,7 @@ The following sets of tools are available:
 
 - **list_branches** - List branches
   - **Required OAuth Scopes**: `repo`
+  - `fields`: Subset of branch fields to return for each branch. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
   - `owner`: Repository owner (string, required)
   - `page`: Page number for pagination (min 1) (number, optional)
   - `perPage`: Results per page for pagination (min 1, max 100) (number, optional)
@@ -1240,6 +1246,7 @@ The following sets of tools are available:
 - **list_commits** - List commits
   - **Required OAuth Scopes**: `repo`
   - `author`: Author username or email address to filter commits by (string, optional)
+  - `fields`: Subset of commit fields to return for each commit. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
   - `owner`: Repository owner (string, required)
   - `page`: Page number for pagination (min 1) (number, optional)
   - `path`: Only commits containing this file path will be returned (string, optional)
@@ -1251,13 +1258,15 @@ The following sets of tools are available:
 
 - **list_releases** - List releases
   - **Required OAuth Scopes**: `repo`
+  - `fields`: Subset of release fields to return for each release. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
   - `owner`: Repository owner (string, required)
   - `page`: Page number for pagination (min 1) (number, optional)
   - `perPage`: Results per page for pagination (min 1, max 100) (number, optional)
   - `repo`: Repository name (string, required)
 
 - **list_tags** - List tags
   - **Required OAuth Scopes**: `repo`
+  - `fields`: Subset of tag fields to return for each tag. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
   - `owner`: Repository owner (string, required)
   - `page`: Page number for pagination (min 1) (number, optional)
   - `perPage`: Results per page for pagination (min 1, max 100) (number, optional)

@@ -0,0 +1,8 @@
+.venv/
+__pycache__/
+out/
+.env
+.env.local
+# Captured fixtures contain real (large) GitHub data; keep only the example committed.
+fixtures/*.json
+!fixtures/example_*.json
@@ -0,0 +1,206 @@
+# Phase 1 evals: context cost vs. benefit of output schemas + field filtering
+
+Small, deterministic scripts to get the numbers we care about, compared across
+**three shippable configurations**:
+
+| Scenario | output schema | `fields` param | what it represents |
+|----------|:---:|:---:|--------------------|
+| **S1 baseline** | ✗ | ✗ | today's behavior (no experiment) |
+| **S2 schema+fields** | ✓ | ✓ | the full experiment |
+| **S3 fields-only** | ✗ | ✓ | hypothesized sweet spot: the model can filter without carrying the heavy schema |
+
+The intuition behind **S3**: the model doesn't need the full output schema to
+filter — it just needs the `fields` param (whose enum already tells it which
+fields exist). So S3 may capture almost all of the benefit at a fraction of the
+fixed cost.
+
+The two numbers we derive:
+
+1. **Fixed tax** — extra tokens added to the `tools/list` payload (paid once at
+   client init) by each scenario.
+2. **Per-call savings** — tokens saved when the model filters a tool response to
+   a subset of fields.
+
+From those: **break-even calls = fixed_tax / avg_savings_per_call**, computed per
+scenario.
+
+No LLM required for the offline numbers (tokenization falls back to a chars/4
+proxy if `tiktoken` isn't installed). The online A/B (step 4) runs the three
+scenarios through a real model over realistic multi-tool sessions.
+
+## Setup
+
+```bash
+cd evals
+python3 -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt          # tiktoken (+ anthropic for the online A/B)
+```
+
+### Token & secrets
+
+Live tool calls and the online A/B need a real GitHub token. **Never hardcode or
+commit it.** Provide it via the environment only; `.env*` is gitignored:
+
+```bash
+export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxx     # read-only scope is enough
+```
+
+A dummy token is used automatically for `capture_tools.py` (it never calls the API).
+
+
+## 1) Fixed tax
+
+Capture the tool list WITH the experiment enabled, then let the analyzer derive
+all three scenarios by stripping `outputSchema` and/or the `fields` property:
+
+```bash
+python3 capture_tools.py --features output_schemas --toolsets all \
+    --out out/tools.treatment.json
+
+python3 fixed_tax.py --tools out/tools.treatment.json --json-out out/fixed_tax.json
+# add --approx if offline without tiktoken
+```
+
+`fixed_tax.py` prints the payload tokens for each scenario (S1/S2/S3), the fixed
+tax of each vs the S1 baseline, a component breakdown (schema vs `fields`), and a
+per-tool breakdown. `--json-out` writes the per-scenario taxes for step 3.
+
+> Tip: measure with the `--toolsets` you'd actually ship. The tax is fixed in
+> absolute tokens but its *relative* size shrinks the more tools you expose.
+
+## 2) Per-call savings (real data)
+
+Capture real full vs filtered responses for the affected tools, straight from
+live GitHub (read-only), then token-diff them:
+
+```bash
+export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxx
+python3 capture_fixtures.py --owner github --repo github-mcp-server --org github
+python3 response_savings.py --fixed-tax-json out/fixed_tax.json
+```
+
+`capture_fixtures.py` calls each tool twice (no `fields` vs a small subset) and
+writes `fixtures/<tool>.full.json` / `.filtered.json`. Use a busy repo for a
+realistic upper bound, **and** a small repo for the lower bound — report a range.
+An example pair is included so the script also runs offline.
+
+## 3) Break-even (per scenario)
+
+`response_savings.py --fixed-tax-json out/fixed_tax.json` prints break-even for
+both taxed scenarios:
+
+```
+break_even_calls = scenario_fixed_tax / avg_saved_per_call
+```
+
+Interpretation:
+- A session with **more** filtered list/search calls than `break_even_calls` is
+  net-positive on context for that scenario.
+- **S3 (fields-only)** has a far smaller tax than S2, so its break-even is tiny —
+  this is the configuration to scrutinize first.
+- Short sessions (few tool calls) are where the fixed tax dominates — call this
+  out in the writeup.
+
+## 4) Online A/B (Phase 2 — real multi-tool sessions, all 3 scenarios)
+
+Runs the same tasks through a real model across all three scenarios, measuring
+cumulative prompt tokens. This is the only way to confirm the model actually
+*uses* `fields` and to get the true net effect — including whether S3 really is
+the sweet spot.
+
+**Use a model with a real context window.** The harness talks to any
+**OpenAI-compatible** endpoint, so you don't need a paid third-party key:
+
+- **GitHub Models** (default) — authenticated with your GitHub token, no extra
+  key. Convenient, but the free tier caps requests at **16,000 tokens**, so large
+  unfiltered responses error out (`413`). Fine for a smoke test; **not** for the
+  headline numbers.
+- **A Copilot / internal proxy** — point `--base-url` at any OpenAI-compatible
+  endpoint you already have access to and pass its token via `--api-key-env`. This
+  is how to run a large-context model (e.g. `claude-opus-4-6`) with no request cap
+  and no out-of-pocket billing.
+
+```bash
+export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxx     # always: the MCP server uses this
+
+# Smoke test: GitHub Models gpt-5 (16k cap — expect overflow failures on big repos)
+python3 schema_fields_eval.py --model openai/gpt-5 --toolsets issues,pull_requests --repeat 3
+
+# Recommended: a large-context model via your OpenAI-compatible endpoint
+export COPILOT_TOKEN=...                          # whatever token that endpoint needs
+python3 schema_fields_eval.py \
+    --base-url https://your-openai-compatible-endpoint/v1 \
+    --api-key-env COPILOT_TOKEN \
+    --model claude-opus-4-6 --toolsets issues,pull_requests --repeat 3
+
+# --base-url <url>          any OpenAI-compatible endpoint
+# --api-key-env VAR         env var holding that endpoint's token
+# --repo owner/repo         target repo for the tasks (default cli/cli; see below)
+# --tasks-file mytasks.txt  one task per line, optionally 'tag<TAB>task'
+```
+
+> Target a readable repo: the tasks run against a **live** repo, so point `--repo`
+> at a large, **public, SAML-free** repo a plain PAT can read (default `cli/cli`).
+> If you aim at a SAML-protected org repo (e.g. `github/github-mcp-server`), every
+> call 403s, the model only ever sees tiny error payloads, and the `fields` arms
+> look like pure overhead because there's nothing to filter — the experiment then
+> measures only the fixed schema/param tax, not the filtering payoff. Such runs
+> are now flagged as failures (a tool returning `isError`) and excluded from the
+> token comparison rather than silently counted as valid.
+
+The server is booted with `--features output_schemas` so the **S2** arm has a real
+schema to embed; the `fields` param and server-side filtering are present in every
+arm regardless, so only what each arm shows the *model* differs.
+
+It prints, per scenario: cumulative prompt/completion tokens, tool-call counts,
+`fields` adoption, the net delta vs the S1 baseline, and a **per-task-type
+breakdown** (narrow / full / neutral) so you can see *where* each config helps.
+Only task-runs where all three arms succeeded count toward the token comparison
+(so on the capped GitHub Models endpoint, the biggest filtering wins — tasks where
+the unfiltered baseline overflowed — show up in the failure counts, not the token
+table; another reason to use a large-context endpoint). Use `--repeat >= 3` to
+average out model nondeterminism. Per-run detail is written to
+`out/schema_fields_eval.jsonl`.
+
+> Task design matters: the default tasks are intentionally **neutral** (they do
+> not tell the model to "return only X"). Biasing prompts toward terse answers
+> would inflate the filtering arms. Keep a balanced mix of narrow/full/neutral.
+
+> Cost control: the default toolsets are narrow on purpose. The relevant
+> differences live in the affected tools, so you don't need all 79 tools loaded
+> each turn. Use `fixed_tax.py` (all toolsets) for the init-tax number and the
+> online run for the savings/net dynamic.
+
+
+
+## Honesty notes
+
+- Tokenizers differ across providers; report **deltas** and state the tokenizer.
+- Step 2 assumes the model actually uses `fields`. That adoption rate can only be
+  confirmed by the Phase 2 online A/B — Phase 1 is an upper bound on benefit.
+- Real response sizes vary a lot by repo; capture fixtures from both a small and a
+  large/busy repo and report a range, not a single number.
+- The `fields` param and the server-side filtering are **not** gated by the
+  `output_schemas` feature flag in the server — only the `outputSchema` and the
+  response's `structuredContent` are. So S1 (baseline) here means "pre-experiment
+  main", and "flag off" in production today would still ship the `fields` param.
+  Reconcile the scenario you measure with the toggle you'd actually ship.
+- With output schemas on, each tool result also carries a `structuredContent`
+  duplicate of the payload. The online A/B forwards only the text content to the
+  model (so all arms see identical response bytes); a client that also feeds
+  `structuredContent` to the model would pay more in the S2 arm. State this
+  assumption when you report.
+
+## Files
+
+| File | Purpose |
+|------|---------|
+| `capture_tools.py` | Boot server over stdio, dump `tools/list` result |
+| `fixed_tax.py` | Per-scenario token-diff (S1/S2/S3); `--json-out` for break-even |
+| `capture_fixtures.py` | Capture real full/filtered tool responses (live GitHub) |
+| `response_savings.py` | Token-diff full vs filtered responses; per-scenario break-even |
+| `schema_fields_eval.py` | 3-scenario (A/B/C) multi-tool agent eval, prompt-token accounting |
+| `_mcp_client.py` | Shared MCP stdio client |
+| `_tokenize.py` | Tokenizer helper (tiktoken or chars/4 fallback) |
+| `fixtures/` | Response pairs (example + captured) |
+