LLM inference callback support#1689
Draft
SteveSandersonMS wants to merge 16 commits into
Draft
Conversation
Adds an opt-in llmInference config to CopilotClientOptions that lets SDK consumers register a callback the runtime invokes whenever it would otherwise issue an outbound non-streaming LLM HTTP request itself.
v1 scope is TS-only/non-streaming, mirroring the runtime support added in github/copilot-agent-runtime. Streaming SSE and WebSocket transports are out of scope for v1 and continue to bypass the callback.
- New `LlmInferenceProvider` interface with a single `onLlmRequest` method.
- `createLlmInferenceAdapter` converts the provider into the wire-shape `LlmInferenceHandler` consumed by the RPC dispatcher.
- Client wiring: `llmInference.setProvider` is sent on connect; per-session adapter is attached alongside the existing sessionFs hook.
- New `llm_inference.e2e.test.ts` exercises the full RPC round-trip against the runtime.
Resolves github/copilot-sdk-internal#88
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Matches the runtime move of `llmInference.httpRequest` out of the session-scoped client API and onto a new `clientGlobal` schema root. - Codegen emits a new `registerClientGlobalApiHandlers` alongside the existing `registerClientSessionApiHandlers`. Handlers passed to it are dispatched directly (no per-session `getHandlers` callback) and carry no implicit sessionId — sessionId, when present, is just a payload field on the call. - `CopilotClient` now constructs the LLM inference adapter once and registers it process-wide via `registerClientGlobalApiHandlers` during connection setup. The per-session `setupLlmInference` path and the `SessionConfigBase.createLlmInferenceProvider` override are removed — there is no longer any per-session notion of which provider to use. - `LlmInferenceConfig.createLlmInferenceProvider` is now `() => LlmInferenceProvider` (was `(session) => ...`). - `LlmInferenceRequest` exposes the new optional `sessionId` field so consumers can correlate requests with a runtime session when one is in scope. E2E test updated to verify the global registration works and that sessionId is populated on in-session traffic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
With the Rust runtime intercept chokepoint in place, every model-layer HTTP request - including /models and /models/session - is now dispatched through the SDK callback. Update the e2e test to: - Stub realistic responses for non-streaming model catalog and session endpoints (so the runtime can proceed past model resolution). - Hard-assert the catalog request is intercepted (no more 'either-or' fallback for the pre-rust-intercept state). Streaming inference requests still pass through to the recorded CAPI proxy; a fully-mocked end-to-end inference test will land alongside the streaming-intercept commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extends LlmInferenceProvider with an optional onLlmStreamRequest method that returns a response head synchronously and pushes body chunks via the provided sink. The adapter implements the generated httpStreamStart RPC method and forwards chunks back to the runtime via the typed server-RPC client (llmInference.streamChunk / streamEnd). Adds a fully-mocked e2e test (test/e2e/llm_inference_stream.e2e.test.ts) that drives a complete user->assistant turn through the callback alone: the runtime hits the callback for /models, /models/session, and the chat completion itself, the assistant text returned to the SDK consumer is the synthetic text supplied by the stub. - nodejs/src/llmInferenceProvider.ts: LlmInferenceStreamSink, onLlmStreamRequest, httpStreamStart adapter - nodejs/src/client.ts: pass a lazy server-RPC accessor into the adapter - nodejs/src/index.ts: re-export new types - nodejs/test/e2e/llm_inference_stream.e2e.test.ts: full-mock e2e - nodejs/src/generated/*, python/*, go/*, rust/*: codegen for new RPC methods - dotnet/src/Generated/*: codegen for new RPC methods Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds test/e2e/llm_inference_errors.e2e.test.ts that wires a callback whose inference handler throws a synthetic transport error and verifies the failure surfaces to the SDK consumer (the call does not hang and any error caught is non-empty). Confirms the runtime's existing retry / error reporting path handles callback-side failures the same way it handles real transport failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirrors the runtime-side cleanup: the callback wire no longer carries providerType / endpointKind / wireApi / transport / modelId. Adapter stops forwarding the field, e2e tests filter by URL instead of metadata, and the missing LlmInferenceStreamSink / LlmInferenceStreamStartResponse re-exports in types.ts are added so index.ts type-checks cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
[Phase 3] Realign the Node SDK with the runtime's new four-method chunk
protocol. One unified provider callback:
interface LlmInferenceProvider {
onLlmRequest(req: LlmInferenceRequest): Promise<void>;
}
LlmInferenceRequest exposes:
* url / method / headers / sessionId
* requestBody: AsyncIterable<Uint8Array> // body delivered as chunks
* responseBody: LlmInferenceResponseSink // start/write/end/error
The sink enforces start -> 0..N writes -> exactly one of end/error and
maps each call to the corresponding httpResponseStart / httpResponseChunk
RPC. createLlmInferenceAdapter maintains a per-requestId state map; the
generated httpRequestStart handler registers state synchronously and
fires onLlmRequest in the background, so the runtime's RPC reply isn't
gated on consumer I/O.
The body queue iterator now latches a 'done' flag so a consumer that
calls .next() again after end:true gets done back instead of blocking
forever waiting for chunks the runtime will never send.
Removes the previous onLlmRequest + onLlmStreamRequest split and the
LlmInferenceResponse / LlmInferenceStreamSink /
LlmInferenceStreamStartResponse public types. All three e2e tests
rewritten against the unified callback (one of them URL-dispatches
/responses -> SSE and /chat/completions -> buffered JSON; the consumer
can also branch on whether the request body has stream:true).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 4.1: expose an AbortSignal on the request envelope, abort it on a
cancel chunk from the runtime, and map consumer-side aborts to a 499 +
error{code:cancelled} response. Adds the cancellation e2e test.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an e2e test asserting that when the SDK consumer signals a terminal
error via responseBody.error({ code: 'cancelled' }) the runtime surfaces
it faithfully as a request failure rather than hanging. Completes the
consumer->runtime direction of Phase 4.1.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Surface the new `transport` discriminator on `LlmInferenceRequest` so consumers can tell an `"http"` request (plain HTTP / SSE) from a `"websocket"` one (full-duplex: each request-body chunk is one inbound WS message, each response-body write one outbound message). The adapter threads `params.transport` through, defaulting to `"http"`. Regenerate rpc.ts against the runtime schema for the new field and add an e2e test exercising the full-duplex path: the fake model advertises `ws:/responses`, the runtime's WebSocket flag is enabled via env var, and the consumer pumps `/responses` events back per inbound message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Friendly product-code starting point for SDK consumers who want to observe or mutate LLM inference requests/responses by overriding virtual methods on a base class. Implements LlmInferenceProvider, so an instance can be returned directly from createLlmInferenceProvider. Default behaviour is a transparent pass-through: each request is forwarded to its original URL via the WHATWG fetch global (HTTP) or WebSocket global (WebSocket), and the upstream response is streamed back unchanged. The same subclass handles both transports - onLlmRequest dispatches on req.transport. Virtual hooks: - HTTP: transformRequest, forward, transformResponse - WebSocket: forwardWebSocket, transformRequestMessage, transformResponseMessage E2e test (llm_inference_handler.e2e.test.ts) demonstrates a single TestHandler subclass servicing both an HTTP turn (single-shot title generation) and a WebSocket turn (main agent turn) against a per-test in-process http+ws upstream that speaks the real CAPI shapes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review fixes for github/copilot-sdk-internal#88 (Node SDK side). - Honor the runtime's accepted=false ack: the response sink now aborts the provider's signal and stops emitting once the runtime drops the request (I1). - Add a staging backstop in the adapter so a body chunk that arrives before its start frame is buffered and replayed rather than silently dropped (B1). - Run the WebSocket request/response pumps concurrently and race their terminal states, so an upstream-closes-first (or runtime-cancels-first) case tears the other side down instead of hanging on a parked iterator (B2). - Buffer inbound WS frames in wrapGlobalWebSocket until onMessage is registered so the first frames of a fast upstream aren't dropped. - Collapse the dead send branch, hoist TextEncoder/TextDecoder singletons, and correct the LlmWebSocketUpstream.onClose contract doc. - Update CopilotClientOptions.llmInference docs: streaming SSE and WebSocket are intercepted, not bypassed (I6). - Add unit tests: chunk-before-start staging, accepted=false abort, WS upstream-close-first finalisation, and WS upstream-error propagation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Drives a CAPI session and a BYOK (openai/responses) session entirely through the LLM inference callback — the consumer fabricates every model-layer response, so the CAPI record/replay proxy is never the inference endpoint. Asserts each in-session inference request carries req.sessionId === session.sessionId and that the two session ids differ. The mock branches /responses on the request stream flag: BYOK turns whose config-derived model does not advertise streaming issue a buffered (non-streaming) /responses request expecting a single JSON response object, whereas the CAPI turn streams via SSE. This mirrors real upstream behaviour and confirms the callback transport faithfully delivers both shapes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirrors the TypeScript LLM inference callback feature in the .NET SDK so consumers can observe/mutate the model-layer HTTP/WebSocket requests the runtime issues (CAPI and BYOK), with the runtime session id threaded into each callback. - scripts/codegen/csharp.ts: emit the clientGlobal handler interface + registration so Rpc.cs gains the llmInference handler surface. - LlmInferenceProvider.cs: low-level ILlmInferenceProvider API + adapter (request staging, response sink state machine) behind an internal ILlmInferenceResponseChannel seam for unit testing. - LlmRequestHandler.cs: idiomatic pass-through base class mapping to HttpRequestMessage/HttpResponseMessage and ClientWebSocket, with virtual transform/forward hooks for both transports. - Types.cs/Client.cs: wire LlmInferenceConfig into the client and register the provider on start. - Tests: factored unit-test infra (recording channel/sink, inline provider, frame builders) with adapter + handler tests, plus CAPI+BYOK e2e tests asserting the session id reaches the callback. e2e provider emits raw JSON (reflection-free STJ) and serves all model-layer traffic off-network. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Hide the redundant low-level provider interface and adapter from the public surface in both SDKs; the sole public extension point is now the LlmRequestHandler base class. Replace the LlmInferenceConfig provider factory with a direct handler instance (the provider is client-global, constructed once with no args). .NET: ILlmInferenceProvider + the LlmInferenceRequest/ResponseInit/ResponseSink DTOs become internal; LlmRequestHandler implements the interface explicitly so OnLlmRequestAsync leaves its public surface. LlmInferenceConfig.Handler replaces the Func<LlmRequestHandler> factory. TS: stop exporting LlmInferenceProvider and createLlmInferenceAdapter from index.ts; LlmInferenceConfig.handler replaces createLlmInferenceProvider. The request/sink DTOs stay exported as onLlmRequest's contract (TS lacks explicit interface implementation). E2E providers become LlmRequestHandler subclasses overriding onLlmRequest. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collapse the HTTP callback seam to SendRequest/sendRequest, replace websocket hooks with per-connection handlers, and update tests to use the forwarding handler model. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds SDK support for intercepting LLM inference requests and handling them in user code across both Node.js and .NET.
It includes the full feature work on this branch:
LlmRequestHandlermodel with forwarding helpersWhat changed
Shared protocol and plumbing
Node.js SDK
LlmRequestHandlerand related types as the main public interception surface.NET SDK
LlmRequestHandler,CopilotWebSocketHandler, and forwarding helpers for the common mutate-and-forward caseAPI shape
SendRequestAsync/sendRequest)OpenWebSocketAsync/openWebSocket, returning a per-connection handler objectUsage examples
C#
Node.js
Tests
This branch adds and updates coverage for:
Resolves github/copilot-sdk-internal#88