fix(langchain): handle Anthropic cache_creation nested-dict in _parse_usage_model#1698
fix(langchain): handle Anthropic cache_creation nested-dict in _parse_usage_model#1698ASAD-BE18 wants to merge 5 commits into
Conversation
…_usage_model
Anthropic's extended prompt-caching API returns cache_creation as a nested
dict keyed by cache tier (e.g. {"ephemeral_1h_input_tokens": 500,
"ephemeral_5m_input_tokens": 0}) inside the LLM output usage object.
Previously this value was passed through to UpdateGenerationBody.usageDetails
unchanged. On v2.x (final filter: not None/not str) the nested dict
survived, causing a Pydantic ValidationError ("value is not a valid integer")
that was silently swallowed by the ingestion queue — dropping the entire
generation end() event and leaving every trace with endTime=null and
input/output=0. On v4.x (final filter: isinstance(v, int)) the nested dict
was silently discarded, losing all cache-creation token data.
Fix: before the final int-filter, pop cache_creation, flatten each tier
value into a cache_creation_{tier_key} entry, and set the aggregated
cache_creation_input_tokens total via setdefault so the legacy field is
not overwritten if already present.
Closes langfuse#1697
|
@claude review |
There was a problem hiding this comment.
LGTM — small, well-scoped fix that flattens Anthropic's nested cache_creation dict; production logic is correct and the inline nit is a test-quality improvement, not a blocker.
Extended reasoning...
Overview
This PR adds a small block inside _parse_usage_model in langfuse/langchain/CallbackHandler.py to handle Anthropic's extended prompt-caching API, where cache_creation arrives as a nested dict (per-tier token counts) rather than a scalar. The block pops the dict, flattens each tier into cache_creation_{tier_key} (mirroring the existing input_token_details pattern), and uses setdefault to expose an aggregate cache_creation_input_tokens without clobbering a pre-existing scalar. Three new unit tests cover the flatten case, all-zeros case, and legacy-field preservation; existing tier tests are unchanged.
Security risks
None. The change is a pure dict-transformation with isinstance guards; no I/O, auth, crypto, or untrusted-input parsing surface is touched.
Level of scrutiny
Low. This is a self-contained parsing helper change behind isinstance(usage_model, dict) and isinstance(usage_model["cache_creation"], dict) guards, with isinstance(int) per-tier filtering that matches the function's existing defensive style. The production logic correctly handles the three relevant cases (flatten, all-zeros aggregate suppression, legacy-field preservation).
Other factors
The single finding is a nit about test_anthropic_cache_creation_legacy_field_not_overwritten — the legacy value (300) equals the tier sum (300), so the test would also pass under a regressed plain-assignment implementation. The production code is correct; this is purely a test-discrimination improvement worth picking up but not blocking. The maintainer explicitly requested review and there are no outstanding human-reviewer comments.
| def test_anthropic_cache_creation_legacy_field_not_overwritten(): | ||
| """If both the legacy cache_creation_input_tokens (int) and the new cache_creation | ||
| (dict) are present, the legacy value is preserved and the dict total is not added.""" | ||
| usage = { | ||
| "input_tokens": 100, | ||
| "output_tokens": 50, | ||
| "cache_creation_input_tokens": 300, # legacy field already present | ||
| "cache_creation": { | ||
| "ephemeral_1h_input_tokens": 200, | ||
| "ephemeral_5m_input_tokens": 100, | ||
| }, | ||
| } | ||
| result = _parse_usage_model(usage) | ||
|
|
||
| # setdefault must not overwrite the existing legacy value | ||
| assert result["cache_creation_input_tokens"] == 300 | ||
| assert "cache_creation" not in result |
There was a problem hiding this comment.
🟡 The test_anthropic_cache_creation_legacy_field_not_overwritten test uses legacy cache_creation_input_tokens=300 and tier values that sum to 200+100=300 — the same value. Because the legacy value equals the tier sum, the assertion result["cache_creation_input_tokens"] == 300 would pass even if setdefault were regressed to a plain assignment (usage_model["cache_creation_input_tokens"] = total), defeating the test's stated purpose of guarding setdefault semantics. Use a legacy value that differs from the tier sum (e.g. cache_creation_input_tokens=999 with tiers summing to 300) so a regression to direct assignment would be caught.
Extended reasoning...
What the bug is. The new test test_anthropic_cache_creation_legacy_field_not_overwritten (tests/unit/test_parse_usage_model.py:57-73) is intended to prove that the production code uses setdefault("cache_creation_input_tokens", total) — i.e. that a pre-existing legacy scalar value is preserved instead of being overwritten by the dict total. As constructed, however, the test cannot distinguish between the correct setdefault implementation and a buggy plain-assignment regression.
How it manifests. The fixture sets cache_creation_input_tokens=300 (legacy scalar) and a nested cache_creation dict whose tiers sum to 200+100=300. The only assertion on this field is assert result["cache_creation_input_tokens"] == 300. Since the legacy value (300) numerically equals the tier sum (300), both implementations produce the same final value:
- Correct (
setdefault): key already exists with 300 → kept at 300 ✓ - Regressed (plain assignment): key overwritten with
total=300→ ends at 300 ✓
Both branches satisfy the assertion, so the test is silent on which behavior the code actually has.
Why existing code doesn't prevent it. Nothing else in the test file pins cache_creation_input_tokens to a value different from the tier sum. The two sibling tests use a single non-overlapping field (test_anthropic_cache_creation_dict_flattened has no legacy field present; test_anthropic_cache_creation_all_zeros_no_aggregate asserts the key is absent). So no test in the suite would fail if setdefault were swapped for direct assignment.
Step-by-step proof. Mentally apply the regressed implementation usage_model["cache_creation_input_tokens"] = total to the fixture:
- Input dict contains
cache_creation_input_tokens: 300andcache_creation: {ephemeral_1h_input_tokens: 200, ephemeral_5m_input_tokens: 100}. - The regressed branch pops
cache_creation, computestotal = 200 + 100 = 300. usage_model["cache_creation_input_tokens"] = 300overwrites the legacy 300 with the new 300 — value unchanged.- Final filter keeps all ints.
result["cache_creation_input_tokens"] == 300— assertion passes despite the regression.
Impact. Pure test-quality issue, no production impact. The production code in langfuse/langchain/CallbackHandler.py correctly uses setdefault and the fix as shipped is correct. But the regression test guarding it has zero discriminating power — a future refactor (or AI-assisted edit) that accidentally drops setdefault for direct assignment would slip through CI.
How to fix. Change the fixture so the legacy value differs from the tier sum, e.g.:
"cache_creation_input_tokens": 999, # legacy field already present
"cache_creation": {
"ephemeral_1h_input_tokens": 200,
"ephemeral_5m_input_tokens": 100,
},and assert result["cache_creation_input_tokens"] == 999. Under setdefault the legacy 999 wins; under plain assignment the value would become 300 and the assertion would fail — giving the test the discriminating power it claims to have.
The legacy value (300) equalled the tier sum (200+100=300), so the assertion would pass even if setdefault were replaced with a plain assignment. Use 999 to make the test discriminating.
Closes #1697
What changed
_parse_usage_modelinlangfuse/langchain/CallbackHandler.pynow handles Anthropic'scache_creationfield when it arrives as a nested dict (tiered prompt-caching API,langchain-anthropic≥ 0.3.x).Root cause
Anthropic's extended prompt-caching API includes a
cache_creationkey in the LLM output usage dict:{ "input_tokens": 9454, "output_tokens": 380, "cache_read_input_tokens": 0, "cache_creation": { "ephemeral_1h_input_tokens": 500, "ephemeral_5m_input_tokens": 0, }, }v2.x (≤ 2.60.10) — The old final filter (
v is not None and not isinstance(v, str)) let the nested dict pass through intoUpdateGenerationBody.usageDetails, which is typed asUnion[Dict[str, int], ...]. Pydantic raised aValidationError("value is not a valid integer") that was swallowed by the ingestion queue, silently dropping the entire generationend()event — every trace hadendTime=nullandinput=0 / output=0.v4.x (HEAD before this PR) — The
isinstance(v, int)filter correctly prevented the crash, but thecache_creationdict was silently discarded, losing all cache-creation token data.Fix
Before the final
isinstance(v, int)filter, popcache_creationand flatten each per-tier value into acache_creation_{tier_key}entry (matching the pattern already used forinput_token_details). If the total is non-zero, also setcache_creation_input_tokensviasetdefaultso the legacy scalar field is not overwritten if already present.Tests added (
tests/unit/test_parse_usage_model.py)test_anthropic_cache_creation_dict_flattenedtest_anthropic_cache_creation_all_zeros_no_aggregatetest_anthropic_cache_creation_legacy_field_not_overwrittensetdefaultpreserves existingcache_creation_input_tokensvalueExisting
test_standard_tier_input_token_detailsandtest_priority_tier_not_subtractedstill pass unchanged.Verification
All pass. No live-provider tests are needed since the fix is unit-testable.
Greptile Summary
This PR fixes
_parse_usage_modelinlangfuse/langchain/CallbackHandler.pyto correctly handle the Anthropic extended prompt-caching API wherecache_creationarrives as a nested dict of per-tier token counts rather than a scalar integer.cache_creationwhen it is a dict, flattens each tier into acache_creation_{tier_key}key, and usessetdefaultto set the legacy aggregatecache_creation_input_tokensonly when the total is non-zero, preserving any pre-existing scalar value.Confidence Score: 5/5
Safe to merge — the change is narrowly scoped to a single parsing helper, all existing tests still pass, and three new unit tests cover the key scenarios introduced by this fix.
The fix is a well-contained addition inside an already-guarded if isinstance(usage_model, dict) block, the logic is straightforward dict-flattening with no side effects on other fields, and the setdefault correctly prevents clobbering an existing scalar. The one theoretical gap — a non-integer tier value being silently skipped — mirrors how the rest of the function already handles unexpected types via the final isinstance(v, int) filter.
No files require special attention.
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[_parse_usage_model called] --> B{has __dict__?} B -- yes --> C[usage equals usage.__dict__] B -- no --> D[usage_model equals usage.copy] C --> D D --> E{OpenAI format?} E -- yes --> F[return usage_model as-is] E -- no --> G[Apply conversion_list key renames] G --> H[Flatten input_token_details] H --> I[Flatten output_token_details] I --> J[Flatten Vertex AI token details] J --> N{cache_creation present AND is dict?} N -- no --> P N -- yes --> O[Pop cache_creation dict and flatten tier keys] O --> O2{total greater than 0?} O2 -- yes --> O3[setdefault cache_creation_input_tokens equals total] O2 -- no --> P O3 --> P[Final filter: keep only int values] P --> Q{usage_model empty?} Q -- yes --> R[return None] Q -- no --> S[return usage_model]Reviews (1): Last reviewed commit: "fix(langchain): handle Anthropic cache_c..." | Re-trigger Greptile