Skip to content

gh-150638: Improve performance of json.loads and json.load for numeric data#150639

Open
eendebakpt wants to merge 10 commits into
python:mainfrom
eendebakpt:json-loads-opt
Open

gh-150638: Improve performance of json.loads and json.load for numeric data#150639
eendebakpt wants to merge 10 commits into
python:mainfrom
eendebakpt:json-loads-opt

Conversation

@eendebakpt

@eendebakpt eendebakpt commented May 30, 2026

Copy link
Copy Markdown
Contributor

_match_number_unicode() (the C accelerator behind json.loads) previously allocated a PyBytes object for every number, copied the digits into it, and then called the generic PyLong_FromString / PyFloat_FromString parsers.
This PR parses the common cases directly from the already-scanned text.

Benchmark main this PR speedup
json.loads, number-heavy document (script below) 3.05 ms 2.38 ms 1.28×
json.load, same document via file object 3.17 ms 2.48 ms 1.28×
pyperformance bm_json_loads 25.2 µs 23.9 µs 1.05×

The standard bm_json_loads document is string/dict-dominated, so it gains
less.

Benchmark script
"""Benchmark json.loads() and json.load() on a number-heavy document.

The document is generated deterministically at import time (no external
files) and resembles a typical telemetry/API payload: a list of records
mixing integers, 19-digit timestamps, negative integers, floats, short
strings, booleans and small integer arrays.

json.load(fp) is json.loads(fp.read()); here fp is an in-memory io.StringIO
(rewound each call) so the same document is parsed without disk noise.

Inline data size: ~304 KiB (2000 records).
"""
import io
import json
import pyperf


def build_document(n=2000):
    return [
        {
            "id": i,
            "timestamp": 1_700_000_000_000_000_000 + i * 1_000,  # 19-digit int
            "value": i * 1.5 - 1000.0,                           # float
            "delta": -i,                                         # negative int
            "label": "item-%d" % i,                              # short string
            "ok": i % 2 == 0,                                    # bool
            "samples": [i, -i, i * 2, i * 3, i * 5],             # int array
        }
        for i in range(n)
    ]


JSON_DATA = json.dumps(build_document())
STREAM = io.StringIO(JSON_DATA)


def load_from_stream():
    STREAM.seek(0)
    return json.load(STREAM)


if __name__ == "__main__":
    runner = pyperf.Runner()
    runner.metadata["description"] = "json.loads()/json.load() on a number-heavy document"
    runner.bench_func("json_loads", json.loads, JSON_DATA)
    runner.bench_func("json_load", load_from_stream)

Add a fast path to _match_number_unicode for integers that fit in a
64-bit integer (at most 19 decimal digits): accumulate the value
directly into an unsigned long long instead of allocating a PyBytes and
calling the generic PyLong_FromString.  Positive values use
PyLong_FromUnsignedLongLong; negatives within long long range use
PyLong_FromLongLong; larger integers fall back to the previous path.

For floats and big integers, copy the (always-ASCII) number text into a
stack buffer for the common short case to avoid the PyBytes allocation,
and call PyOS_string_to_double directly for floats.

Benchmarks (optimized free-threaded build):
* pyperformance json_loads: 1.06x faster overall
* microbench: small int arrays ~2x, 20-int doc 1.48x, mixed dict 1.16x

All test_json tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@markshannon markshannon left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but the wide scope of numstr makes it a bit hard to reason about refleaks.

Can you remove the early declaration of numstr and define it where you need it.

Comment thread Modules/_json.c
}
else
rval = PyLong_FromString(buf, NULL, 10);
Py_XDECREF(numstr);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numstr is only defined on lines 1068 and 1108.
You could declare it only in those places, and decref it in the same blocks.

Comment thread Modules/_json.c Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants