benchmark: replace simulation with walltime macro benches + e2e correctness#91
Merged
Conversation
Drives createInput as a consumer would (read chunk, scan, flush a pending ESC, dispatch) and asserts the same event stream whether input is fed whole, byte-by-byte, or split mid-sequence. Covers the lone-ESC flush path.
In-process bench over a large mixed corpus, measured in WallTime where real work dominates placement/JIT/alloc noise (the Simulation micro-benches sit on codegen cliffs). Feeds the corpus in small reads: a single scan() drains at most 256 events (the wasm event-buffer cap), so small reads keep every call under the cap and process the whole corpus.
The Simulation input micro-benches (long input burst, printable ASCII single char) move by 50-90% on unrelated changes, even a test-only rename, because their simulated cost snaps to a different value when the combined wasm shifts. Input perf is now gated by the throughput WallTime bench; correctness by the event-loop integration test.
Pin ubuntu-24.04 so the wasm toolchain is stable and drop the wasm cache so main and PRs always rebuild identically. The cache froze main's baseline on a stale build, so every PR compared a fresh build against a stale baseline and produced phantom regressions.
commit: |
Contributor
Merging this PR will not alter performance
Performance Changes
Comparing Footnotes
|
CodSpeed's walltime tinybench plugin only populates result.latency on its async path; a sync task fn leaves it undefined and crashes (Cannot destructure 'min' of result.latency). startup.bench works because its tasks return a promise (spawnFixture). Return Promise.resolve() from the throughput task so the plugin takes the async path — a bare async fn with no await would trip deno's require-await lint. The walltime job runs startup and throughput as separate node processes.
… job CodSpeed Simulation (Valgrind) is unviable for CI here: flaky measurements (dashboard layout swung 20x, diff render 17% on changes that touch no render code) and unpredictable runtime — the same commit's simulation job finished in ~2 min one run and hung past 30 min the next. Convert render/ops to ms-scale WallTime macro benches (looped, promise-returning, ~7-11ms at <1% variance) run as separate node processes in the walltime job, and drop the simulation job entirely. mod.ts is now a local aggregator for deno task bench.
12 tasks
jbolda
requested changes
Jun 8, 2026
jbolda
left a comment
Collaborator
There was a problem hiding this comment.
The change to walltime makes sense. We were just discussing this aspect of codspeed on Thursday for some effection code. Might be worth just generally writing down the whys and whats for some of the decisions we made though!
cowboyd
reviewed
Jun 8, 2026
jbolda
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Our CodSpeed benchmarks were frequently inaccurate: they flagged big "regressions"/"improvements" on PRs that don't touch the measured code (a test-file rename alone showed −54% on input; "dashboard layout" swung 20×). And the simulation job's runtime was wildly unpredictable.