fix(langchain): handle Anthropic cache_creation nested-dict in _parse_usage_model by ASAD-BE18 · Pull Request #1698 · langfuse/langfuse-python

ASAD-BE18 · 2026-06-10T09:01:08Z

What changed

_parse_usage_model in langfuse/langchain/CallbackHandler.py now handles Anthropic's cache_creation field when it arrives as a nested dict (tiered prompt-caching API, langchain-anthropic ≥ 0.3.x).

Root cause

Anthropic's extended prompt-caching API includes a cache_creation key in the LLM output usage dict:

{
    "input_tokens": 9454,
    "output_tokens": 380,
    "cache_read_input_tokens": 0,
    "cache_creation": {
        "ephemeral_1h_input_tokens": 500,
        "ephemeral_5m_input_tokens": 0,
    },
}

v2.x (≤ 2.60.10) — The old final filter (v is not None and not isinstance(v, str)) let the nested dict pass through into UpdateGenerationBody.usageDetails, which is typed as Union[Dict[str, int], ...]. Pydantic raised a ValidationError ("value is not a valid integer") that was swallowed by the ingestion queue, silently dropping the entire generation end() event — every trace had endTime=null and input=0 / output=0.

v4.x (HEAD before this PR) — The isinstance(v, int) filter correctly prevented the crash, but the cache_creation dict was silently discarded, losing all cache-creation token data.

Fix

Before the final isinstance(v, int) filter, pop cache_creation and flatten each per-tier value into a cache_creation_{tier_key} entry (matching the pattern already used for input_token_details). If the total is non-zero, also set cache_creation_input_tokens via setdefault so the legacy scalar field is not overwritten if already present.

# Anthropic extended prompt caching: cache_creation is a dict keyed by cache tier.
if "cache_creation" in usage_model and isinstance(usage_model["cache_creation"], dict):
    cache_creation = usage_model.pop("cache_creation")
    total = 0
    for tier_key, tier_val in cache_creation.items():
        if isinstance(tier_val, int):
            usage_model[f"cache_creation_{tier_key}"] = tier_val
            total += tier_val
    if total > 0:
        usage_model.setdefault("cache_creation_input_tokens", total)

Tests added (`tests/unit/test_parse_usage_model.py`)

Test	What it verifies
`test_anthropic_cache_creation_dict_flattened`	Per-tier keys present; aggregate = sum of tiers; original dict key absent
`test_anthropic_cache_creation_all_zeros_no_aggregate`	Zero-value tiers stored; no aggregate added when total is zero
`test_anthropic_cache_creation_legacy_field_not_overwritten`	`setdefault` preserves existing `cache_creation_input_tokens` value

Existing test_standard_tier_input_token_details and test_priority_tier_not_subtracted still pass unchanged.

Verification

uv run --frozen pytest tests/unit/test_parse_usage_model.py -v
uv run --frozen ruff check langfuse/langchain/CallbackHandler.py tests/unit/test_parse_usage_model.py
uv run --frozen ruff format --check langfuse/langchain/CallbackHandler.py tests/unit/test_parse_usage_model.py

All pass. No live-provider tests are needed since the fix is unit-testable.

Greptile Summary

This PR fixes _parse_usage_model in langfuse/langchain/CallbackHandler.py to correctly handle the Anthropic extended prompt-caching API where cache_creation arrives as a nested dict of per-tier token counts rather than a scalar integer.

The new block pops cache_creation when it is a dict, flattens each tier into a cache_creation_{tier_key} key, and uses setdefault to set the legacy aggregate cache_creation_input_tokens only when the total is non-zero, preserving any pre-existing scalar value.
Three focused unit tests cover the flattening, the all-zeros case, and the legacy-field preservation case.

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped to a single parsing helper, all existing tests still pass, and three new unit tests cover the key scenarios introduced by this fix.

The fix is a well-contained addition inside an already-guarded if isinstance(usage_model, dict) block, the logic is straightforward dict-flattening with no side effects on other fields, and the setdefault correctly prevents clobbering an existing scalar. The one theoretical gap — a non-integer tier value being silently skipped — mirrors how the rest of the function already handles unexpected types via the final isinstance(v, int) filter.

No files require special attention.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[_parse_usage_model called] --> B{has __dict__?}
    B -- yes --> C[usage equals usage.__dict__]
    B -- no --> D[usage_model equals usage.copy]
    C --> D
    D --> E{OpenAI format?}
    E -- yes --> F[return usage_model as-is]
    E -- no --> G[Apply conversion_list key renames]
    G --> H[Flatten input_token_details]
    H --> I[Flatten output_token_details]
    I --> J[Flatten Vertex AI token details]
    J --> N{cache_creation present AND is dict?}
    N -- no --> P
    N -- yes --> O[Pop cache_creation dict and flatten tier keys]
    O --> O2{total greater than 0?}
    O2 -- yes --> O3[setdefault cache_creation_input_tokens equals total]
    O2 -- no --> P
    O3 --> P[Final filter: keep only int values]
    P --> Q{usage_model empty?}
    Q -- yes --> R[return None]
    Q -- no --> S[return usage_model]

_{Reviews (1): Last reviewed commit: "fix(langchain): handle Anthropic cache_c..." | Re-trigger Greptile}

…_usage_model Anthropic's extended prompt-caching API returns cache_creation as a nested dict keyed by cache tier (e.g. {"ephemeral_1h_input_tokens": 500, "ephemeral_5m_input_tokens": 0}) inside the LLM output usage object. Previously this value was passed through to UpdateGenerationBody.usageDetails unchanged. On v2.x (final filter: not None/not str) the nested dict survived, causing a Pydantic ValidationError ("value is not a valid integer") that was silently swallowed by the ingestion queue — dropping the entire generation end() event and leaving every trace with endTime=null and input/output=0. On v4.x (final filter: isinstance(v, int)) the nested dict was silently discarded, losing all cache-creation token data. Fix: before the final int-filter, pop cache_creation, flatten each tier value into a cache_creation_{tier_key} entry, and set the aggregated cache_creation_input_tokens total via setdefault so the legacy field is not overwritten if already present. Closes langfuse#1697

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

CLAassistant · 2026-06-10T09:01:15Z

All committers have signed the CLA.

hassiebp · 2026-06-12T09:45:08Z

@claude review

claude

LGTM — small, well-scoped fix that flattens Anthropic's nested cache_creation dict; production logic is correct and the inline nit is a test-quality improvement, not a blocker.

Extended reasoning...

Overview

This PR adds a small block inside _parse_usage_model in langfuse/langchain/CallbackHandler.py to handle Anthropic's extended prompt-caching API, where cache_creation arrives as a nested dict (per-tier token counts) rather than a scalar. The block pops the dict, flattens each tier into cache_creation_{tier_key} (mirroring the existing input_token_details pattern), and uses setdefault to expose an aggregate cache_creation_input_tokens without clobbering a pre-existing scalar. Three new unit tests cover the flatten case, all-zeros case, and legacy-field preservation; existing tier tests are unchanged.

Security risks

None. The change is a pure dict-transformation with isinstance guards; no I/O, auth, crypto, or untrusted-input parsing surface is touched.

Level of scrutiny

Low. This is a self-contained parsing helper change behind isinstance(usage_model, dict) and isinstance(usage_model["cache_creation"], dict) guards, with isinstance(int) per-tier filtering that matches the function's existing defensive style. The production logic correctly handles the three relevant cases (flatten, all-zeros aggregate suppression, legacy-field preservation).

Other factors

The single finding is a nit about test_anthropic_cache_creation_legacy_field_not_overwritten — the legacy value (300) equals the tier sum (300), so the test would also pass under a regressed plain-assignment implementation. The production code is correct; this is purely a test-discrimination improvement worth picking up but not blocking. The maintainer explicitly requested review and there are no outstanding human-reviewer comments.

claude · 2026-06-12T09:52:59Z

+def test_anthropic_cache_creation_legacy_field_not_overwritten():
+    """If both the legacy cache_creation_input_tokens (int) and the new cache_creation
+    (dict) are present, the legacy value is preserved and the dict total is not added."""
+    usage = {
+        "input_tokens": 100,
+        "output_tokens": 50,
+        "cache_creation_input_tokens": 300,  # legacy field already present
+        "cache_creation": {
+            "ephemeral_1h_input_tokens": 200,
+            "ephemeral_5m_input_tokens": 100,
+        },
+    }
+    result = _parse_usage_model(usage)
+
+    # setdefault must not overwrite the existing legacy value
+    assert result["cache_creation_input_tokens"] == 300
+    assert "cache_creation" not in result


🟡 The test_anthropic_cache_creation_legacy_field_not_overwritten test uses legacy cache_creation_input_tokens=300 and tier values that sum to 200+100=300 — the same value. Because the legacy value equals the tier sum, the assertion result["cache_creation_input_tokens"] == 300 would pass even if setdefault were regressed to a plain assignment (usage_model["cache_creation_input_tokens"] = total), defeating the test's stated purpose of guarding setdefault semantics. Use a legacy value that differs from the tier sum (e.g. cache_creation_input_tokens=999 with tiers summing to 300) so a regression to direct assignment would be caught.

Extended reasoning...

What the bug is. The new test test_anthropic_cache_creation_legacy_field_not_overwritten (tests/unit/test_parse_usage_model.py:57-73) is intended to prove that the production code uses setdefault("cache_creation_input_tokens", total) — i.e. that a pre-existing legacy scalar value is preserved instead of being overwritten by the dict total. As constructed, however, the test cannot distinguish between the correct setdefault implementation and a buggy plain-assignment regression.

How it manifests. The fixture sets cache_creation_input_tokens=300 (legacy scalar) and a nested cache_creation dict whose tiers sum to 200+100=300. The only assertion on this field is assert result["cache_creation_input_tokens"] == 300. Since the legacy value (300) numerically equals the tier sum (300), both implementations produce the same final value:

Correct (setdefault): key already exists with 300 → kept at 300 ✓

Regressed (plain assignment): key overwritten with total=300 → ends at 300 ✓

Both branches satisfy the assertion, so the test is silent on which behavior the code actually has.

Why existing code doesn't prevent it. Nothing else in the test file pins cache_creation_input_tokens to a value different from the tier sum. The two sibling tests use a single non-overlapping field (test_anthropic_cache_creation_dict_flattened has no legacy field present; test_anthropic_cache_creation_all_zeros_no_aggregate asserts the key is absent). So no test in the suite would fail if setdefault were swapped for direct assignment.

Step-by-step proof. Mentally apply the regressed implementation usage_model["cache_creation_input_tokens"] = total to the fixture:

Input dict contains cache_creation_input_tokens: 300 and cache_creation: {ephemeral_1h_input_tokens: 200, ephemeral_5m_input_tokens: 100}.

The regressed branch pops cache_creation, computes total = 200 + 100 = 300.

usage_model["cache_creation_input_tokens"] = 300 overwrites the legacy 300 with the new 300 — value unchanged.

Final filter keeps all ints. result["cache_creation_input_tokens"] == 300 — assertion passes despite the regression.

Impact. Pure test-quality issue, no production impact. The production code in langfuse/langchain/CallbackHandler.py correctly uses setdefault and the fix as shipped is correct. But the regression test guarding it has zero discriminating power — a future refactor (or AI-assisted edit) that accidentally drops setdefault for direct assignment would slip through CI.

How to fix. Change the fixture so the legacy value differs from the tier sum, e.g.:

"cache_creation_input_tokens": 999, # legacy field already present "cache_creation": { "ephemeral_1h_input_tokens": 200, "ephemeral_5m_input_tokens": 100, },

and assert result["cache_creation_input_tokens"] == 999. Under setdefault the legacy 999 wins; under plain assignment the value would become 300 and the assertion would fail — giving the test the discriminating power it claims to have.

fixed in fe3af4d

The legacy value (300) equalled the tier sum (200+100=300), so the assertion would pass even if setdefault were replaced with a plain assignment. Use 999 to make the test discriminating.

claude Bot reviewed Jun 10, 2026

View reviewed changes

Merge branch 'main' into fix/anthropic-cache-creation-usage

348a4ab

claude Bot reviewed Jun 12, 2026

View reviewed changes

ASAD-BE18 added 3 commits June 12, 2026 14:56

Merge branch 'main' into fix/anthropic-cache-creation-usage

7ce5654

test(langchain): use distinct legacy value in setdefault test

fe3af4d

The legacy value (300) equalled the tier sum (200+100=300), so the assertion would pass even if setdefault were replaced with a plain assignment. Use 999 to make the test discriminating.

Merge branch 'main' into fix/anthropic-cache-creation-usage

d2d1432

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(langchain): handle Anthropic cache_creation nested-dict in _parse_usage_model#1698

fix(langchain): handle Anthropic cache_creation nested-dict in _parse_usage_model#1698
ASAD-BE18 wants to merge 5 commits into
langfuse:mainfrom
ASAD-BE18:fix/anthropic-cache-creation-usage

ASAD-BE18 commented Jun 10, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

claude Bot left a comment

Uh oh!

CLAassistant commented Jun 10, 2026 •

edited

Loading

Uh oh!

hassiebp commented Jun 12, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot Jun 12, 2026

Uh oh!

ASAD-BE18 Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ASAD-BE18 commented Jun 10, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Root cause

Fix

Tests added (tests/unit/test_parse_usage_model.py)

Verification

Greptile Summary

Confidence Score: 5/5

Flowchart

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

CLAassistant commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hassiebp commented Jun 12, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

claude Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

ASAD-BE18 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ASAD-BE18 commented Jun 10, 2026 •

edited by greptile-apps Bot

Loading

Tests added (`tests/unit/test_parse_usage_model.py`)

CLAassistant commented Jun 10, 2026 •

edited

Loading