Skip to content

feat(datasets): support multimodal dataset items#1710

Open
wochinge wants to merge 11 commits into
mainfrom
feature/lfe-10289-python-sdk-changes
Open

feat(datasets): support multimodal dataset items#1710
wochinge wants to merge 11 commits into
mainfrom
feature/lfe-10289-python-sdk-changes

Conversation

@wochinge

@wochinge wochinge commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds generated API support for dataset item media references and trace-less media uploads.
  • Adds SDK support for creating dataset items with LangfuseMedia in input, expected_output, and metadata.
  • Adds optional get_dataset(..., resolve_media_references=True) hydration to replace backend-reported media reference JSONPaths with LangfuseMediaReference objects.
  • Adds focused unit coverage and an e2e test using a real JPEG fixture.

Linear

Major Decisions

  • Keeps get_dataset(...) as the only high-level SDK read surface for resolved dataset media references.
  • Uses backend-provided JSONPaths for read hydration and jsonpath-ng recursive discovery for create-path media replacement.
  • Splits generated API updates and handwritten SDK changes into separate commits to simplify review.

Review Focus

  • Dataset media upload behavior without trace/observation context.
  • JSONPath-based replacement for both create and read paths.
  • Public SDK surface: LangfuseMediaReference export and get_dataset(..., resolve_media_references=True).

Greptile Summary

This PR adds multimodal support to Langfuse dataset items, allowing LangfuseMedia objects to be embedded in input, expected_output, and metadata fields and uploaded trace-lessly, and adding a resolve_media_references=True option to get_dataset() that hydrates backend-provided JSONPath references into LangfuseMediaReference objects.

  • create_dataset_item now recursively discovers LangfuseMedia values via $..this`` (jsonpath-ng), uploads each one synchronously through the updated _upload_media_sync (which accepts `trace_id=None`), deduplicates by `media_id` within a single call, and replaces each value with its reference string before sending to the API.
  • get_dataset(resolve_media_references=True) passes include_media_references to the list endpoint, then walks backend-provided JSONPath annotations to replace reference strings with frozen LangfuseMediaReference dataclass instances that expose fetch_bytes(), fetch_base64(), and fetch_data_uri() helpers.
  • Generated API types (DatasetItemMediaReference, DatasetItemMediaReferenceField, DatasetItemMediaReferenceMedia) and the GetMediaUploadUrlRequest schema are updated to make trace_id and field optional, enabling trace-less uploads.

Confidence Score: 4/5

Safe to merge; all changed paths work correctly for the expected JSON-like data shapes, and the two main concerns are style/fragility rather than broken behavior.

The media-upload and hydration flows are logically sound and covered by both unit and E2E tests. The two notable findings are: _replace_json_path_value discards the return value of update() and relies implicitly on in-place mutation (works today, diverges from how _process_dataset_item_media handles the same operation); and fetch_bytes() makes a blocking HTTP call with no timeout. Neither produces wrong output in the current code paths, but both leave the door open for subtle failures on refactoring or slow network conditions.

langfuse/_client/client.py (_replace_json_path_value) and langfuse/media.py (LangfuseMediaReference.fetch_bytes) are the two spots that would benefit from a second look before shipping to customers at scale.

Sequence Diagram

sequenceDiagram
    participant User
    participant Langfuse SDK
    participant MediaManager
    participant Langfuse API
    participant Object Storage

    Note over User,Object Storage: create_dataset_item with LangfuseMedia

    User->>Langfuse SDK: create_dataset_item(input={"img": LangfuseMedia(...)})
    Langfuse SDK->>Langfuse SDK: _process_dataset_item_media()<br/>jsonpath-ng discovers LangfuseMedia nodes
    Langfuse SDK->>MediaManager: _upload_media_sync(media)
    MediaManager->>Langfuse API: media.get_upload_url(trace_id=None, field=None)
    Langfuse API-->>MediaManager: upload_url + media_id
    MediaManager->>Object Storage: PUT bytes
    Object Storage-->>MediaManager: 200 OK
    MediaManager->>Langfuse API: media.patch(media_id, upload_http_status=200)
    MediaManager-->>Langfuse SDK: done
    Langfuse SDK->>Langfuse API: dataset_items.create(input={"img": "@@@langfuseMedia:..."})
    Langfuse API-->>Langfuse SDK: DatasetItem
    Langfuse SDK-->>User: DatasetItem (reference strings)

    Note over User,Object Storage: get_dataset with resolve_media_references=True

    User->>Langfuse SDK: get_dataset(name, resolve_media_references=True)
    Langfuse SDK->>Langfuse API: dataset_items.list(includeMediaReferences=true)
    Langfuse API-->>Langfuse SDK: items with mediaReferences[] (JSONPaths + signed URLs)
    loop For each item
        Langfuse SDK->>Langfuse SDK: _hydrate_dataset_item_media_references()<br/>jsonpath-ng update per JSONPath to LangfuseMediaReference
    end
    Langfuse SDK-->>User: DatasetClient (LangfuseMediaReference in fields)
    User->>Langfuse SDK: item.input["img"].fetch_bytes()
    Langfuse SDK->>Object Storage: httpx.get(signed_url)
    Object Storage-->>Langfuse SDK: bytes
    Langfuse SDK-->>User: bytes
Loading
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
langfuse/_client/client.py:3418-3426
`_replace_json_path_value` silently discards the return value of `update()`, relying entirely on in-place mutation of `value`. This works today because API-deserialized data is always a mutable `dict` or `list`, but it diverges from the pattern used in `_process_dataset_item_media` (which correctly captures `data = match.full_path.update(data, ...)`). If `value` were ever an immutable type at a non-root path, the replacement would be silently lost without any log or exception.

```suggestion
        try:
            value = parse_jsonpath(json_path).update(value, replacement)
        except Exception as e:
            langfuse_logger.debug(
                f"Failed to hydrate dataset media reference at JSONPath {json_path}",
                exc_info=e,
            )

        return value
```

### Issue 2 of 3
langfuse/media.py:50-53
`httpx.get(self.url)` is called without a timeout, so it can block indefinitely if the signed URL is slow or the connection hangs. This is especially relevant in hot paths such as evaluation loops that call `fetch_bytes()` on each dataset item's resolved media.

```suggestion
    def fetch_bytes(self, timeout: float = 30.0) -> bytes:
        """Fetch the media content from the signed URL."""
        response = httpx.get(self.url, timeout=timeout)
        response.raise_for_status()
```

### Issue 3 of 3
langfuse/_task_manager/media_manager.py:233-260
**Silent return on incomplete media is undocumented**

`_upload_media_sync` silently returns `None` when any content field is `None`. The caller `_upload_dataset_item_media` then adds `media_id` to `uploaded_media_ids` and returns the reference string as if the upload succeeded — meaning the dataset item would store a reference to media that was never uploaded.

In practice this path is currently unreachable: the earlier check `if reference_string is None or media_id is None: raise ValueError(...)` in `_upload_dataset_item_media` ensures all content fields are populated before this method is called (because `_media_id` is derived from `_content_sha256_hash`, which requires `_content_bytes`). But the silent return offers no signal to future callers that skip that guard, and no explanation is left in a comment. A guarded `raise` or at minimum a comment explaining the invariant would make this safer.

Reviews (1): Last reviewed commit: "feat(datasets): support media references" | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

wochinge added 2 commits June 15, 2026 11:40
Adds dataset item media upload and optional get_dataset media hydration.
@github-actions

Copy link
Copy Markdown

@claude review

Comment thread langfuse/_client/client.py
Comment thread langfuse/media.py Outdated
Comment thread langfuse/_task_manager/media_manager.py
Comment thread langfuse/_client/client.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd791b64ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread langfuse/_client/client.py Outdated
Comment thread langfuse/media.py Outdated
Comment thread langfuse/_client/client.py
Comment thread langfuse/_client/client.py Outdated

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actual (non generated changes)

Comment on lines +3347 to +3348
# Avoid jsonpath-ng here: dataset writes should keep working
# under python -OO where parser docstrings may be stripped.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my opinion an okay compromise as the write path doesn't become a hard blocker for users and easy to maintain

Avoid hand rolling jsonpath parsing for the read path -> users can also opt out here so in my opinion fine

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actual (non generated changes)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actual (non generated changes)

Comment thread langfuse/media.py

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actual (non generated changes)

Comment thread pyproject.toml
"opentelemetry-api>=1.33.1,<2",
"opentelemetry-sdk>=1.33.1,<2",
"opentelemetry-exporter-otlp-proto-http>=1.33.1,<2",
"jsonpath-ng>=1.8.0,<2",

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache 2.0 license, 731 stars, actively maintained: https://github.com/h2non/jsonpath-ng
Annoying thing is the issues about running optimized python but I think we addressed them enough: https://github.com/h2non/jsonpath-ng#special-note-about-ply-and-docstrings

Next runner up would be jsonpath-python (https://pypi.org/project/python-jsonpath/)

  • MIT license
  • only 69 stars

I'd avoid hand rolling the parsing. Especially since we have find references via the TS jsonpath library we don't know exactly which syntax might get in there and which cases of the spec we need to handle.

Comment thread langfuse/_client/client.py
Resolved LangfuseMediaReference items from get_dataset(resolve_media_
references=True) discarded the original @@@langfuseMedia:...@@@ string,
so feeding them back into create_dataset_item or run_experiment
serialized them as opaque dicts and orphaned the media. Persist the
reference string and emit it from both _process_dataset_item_media and
EventSerializer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread langfuse/_client/client.py
wochinge added a commit to langfuse/langfuse-js that referenced this pull request Jun 16, 2026
Mirror langfuse/langfuse-python#1710 to add media support to datasets:

- LangfuseMediaReference (core): a signed-URL media handle returned when
  resolving dataset media, with urlIsExpired / fetchBytes / fetchBase64 /
  fetchDataUri helpers for feeding media to LLM providers
- uploadMedia (core): reusable upload routine that works without a trace
  context; MediaService now delegates to it instead of duplicating the
  upload + backoff logic
- DatasetManager.createItem: uploads any LangfuseMedia found in input,
  expectedOutput, or metadata (deduped) and replaces it with a reference
  string before creating the item; createDatasetItem now routes here
- DatasetManager.get(resolveMediaReferences): requests includeMediaReferences
  and hydrates reference strings into LangfuseMediaReference objects using
  jsonpath-plus (eval-free, so it is safe under strict CSP / edge runtimes)
- re-export LangfuseMedia and LangfuseMediaReference from @langfuse/client

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wochinge and others added 2 commits June 16, 2026 16:25
_process_dataset_item_media only recursed into list/dict, so a
LangfuseMedia held in a tuple, set, or frozenset slipped through to
fern's encoder and got silently base64-inlined instead of uploaded.
Widen the walker to those containers and rebuild them in place.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…name

The DatasetItemMediaReferenceField enum value changed from
expected_output to expectedOutput. Decouple hydration from the wire
value by mapping the enum member to the model attribute, so the rename
(and any future one) no longer silently skips expected-output media
references.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread langfuse/_client/client.py Outdated
Rebuilding tuples via type(data)(iterable) breaks namedtuple/NamedTuple
subclasses, which take positional field args rather than an iterable.
Keep list/set/frozenset handling and leave tuples untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread langfuse/_client/resource_manager.py
get_singleton_httpx_client silently returned the first-inserted
instance, so a LangfuseMediaReference fetched from one client could go
out through another client's transport (proxy/CA/mTLS). Mirror
get_client: warn and fall back to default httpx when multiple clients
exist, and let fetch_bytes/fetch_base64/fetch_data_uri take an explicit
httpx client to deterministically honor per-client settings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines 62 to 73
or f"<Upload handling failed for LangfuseMedia of type {obj._content_type}>"
)

if (
isinstance(obj, LangfuseMediaReference)
and obj.reference_string is not None
):
return obj.reference_string

# Check if numpy is available and if the object is a numpy scalar
# If so, convert it to a Python scalar using the item() method
if np is not None and isinstance(obj, np.generic):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new LangfuseMediaReference opacity branch at serializer.py:65-69 only fires when an LMR is dispatched directly to default() (top-level or as a dict/list element). When an LMR is nested inside a user @dataclass or pydantic.BaseModel, the existing is_dataclass: return asdict(obj) branch at line 130 (and the model_dump() branch at 132-139) does CPython's recursive conversion that never re-enters default(), so the nested LMR (itself a frozen dataclass) is emitted as a flat dict exposing url, url_expiry, content_length, and reference_string as raw JSON fields — defeating the opacity this PR was explicitly added to provide. Fix: in the dataclass and BaseModel branches, iterate fields manually and route each value through self.default() so nested LMRs (and LangfuseMedia) hit the opacity branches.

Extended reasoning...

What the bug is

This PR adds a new opacity branch in EventSerializer._default_inner for LangfuseMediaReference (serializer.py:65-69), mirroring the existing branch for LangfuseMedia at 59-63. Both are meant to collapse the rich media object back to its @@@langfuseMedia:...@@@ reference string instead of letting CPython's default object serialization leak the signed URL and other transport-only fields.

That branch only fires when the LMR is dispatched directly to default() — i.e., top-level, or as a dict/list element (since the dict/list branches at lines 148-152 explicitly call self.default(v) on each value). But the dataclass branch (return asdict(obj), line 130) and the Pydantic BaseModel branch (return obj.model_dump(), line 139) do CPython-native recursive conversion that never calls back into default(). LangfuseMediaReference is itself a @dataclass(frozen=True), so when wrapped in a parent dataclass asdict walks into it and emits its fields as a flat dict.

Why existing safeguards don't catch it

  • The new unit test test_langfuse_media_reference_serializes_to_reference_string only verifies the top-level case.
  • The existing LangfuseMedia branch has the exact same blind spot but is partially masked by the trace-side flow: MediaManager._find_and_process_media traverses dicts/lists and replaces nested LangfuseMedia instances with reference-string equivalents before EventSerializer ever sees them. There's no equivalent traversal for LangfuseMediaReference.
  • Same shape exists on the write path via fern's jsonable_encoder (langfuse/api/core/jsonable_encoder.py), which also uses asdict / model_dump — so the same nested-LMR-in-user-dataclass leak reaches create_dataset_item API bodies and persists in stored dataset items, breaking the read-side round-trip guarantee that bug Bump langchain from 0.0.237 to 0.0.262 #5's reference_string field added for direct-LMR-in-dict.

Step-by-step proof

Verified empirically against current HEAD:

@dataclass
class Container:
    image_ref: LangfuseMediaReference
    note: str = 'hello'

lmr = LangfuseMediaReference(
    media_id='abc', content_type='image/png',
    url='https://signed.example.com/SECRET_TOKEN',
    url_expiry='2026-06-15T12:00:00Z', content_length=1234,
    reference_string='@@@langfuseMedia:type=image/png|id=abc|source=bytes@@@',
)
s = EventSerializer()

print(s.encode(lmr))
# "@@@langfuseMedia:type=image/png|id=abc|source=bytes@@@"        (opacity preserved)

print(s.encode({'image_ref': lmr}))
# {"image_ref": "@@@langfuseMedia:..."}                            (opacity preserved)

print(s.encode(Container(image_ref=lmr)))
# {"image_ref": {"media_id": "abc", "content_type": "image/png",
#                "url": "https://signed.example.com/SECRET_TOKEN",
#                "url_expiry": "...", "content_length": 1234,
#                "reference_string": "@@@..."}, "note": "hello"}   (URL LEAKED)

Same shape for pydantic.BaseModel: s.encode(PydanticContainer(image_ref=lmr)) emits the identical flat-dict structure with the signed URL inlined.

Impact

The signed url and url_expiry end up serialized into trace data and (via fern's encoder on the write path) into create_dataset_item API bodies / stored dataset items, even though this PR explicitly added the opacity branch to prevent that.

The round-trip case from bug #5's fix breaks specifically when an LMR is wrapped in a user dataclass/Pydantic model: the server's media-reference scanner still finds the @@@langfuseMedia:...@@@ pattern inside the persisted reference_string field on hydration, so get_dataset(resolve_media_references=True) emits an LMR at $['wrap']['image_ref']['reference_string'] but leaves the sibling url/url_expiry/content_length strings stale — the user observes a mixed/confusing shape where item.input['wrap']['image_ref'] is a dict (not an LMR) and ['url'] is a soon-to-expire string from the original write.

The trigger is plausible: users with @observe-decorated functions returning typed results (dataclasses or Pydantic models) wrapping LMRs obtained from get_dataset(resolve_media_references=True) — a natural pattern in ML/LLM application code that builds domain types around SDK outputs.

How to fix

In the dataclass branch (and analogously for the Pydantic branch), iterate fields manually and route each value back through self.default() so nested LMRs / LangfuseMedia hit the opacity branches:

if is_dataclass(obj):
    from dataclasses import fields
    return {f.name: self.default(getattr(obj, f.name)) for f in fields(obj)}

if isinstance(obj, BaseModel):
    obj.model_rebuild()
    if isinstance(raw := getattr(obj, "raw", None), BaseModel):
        raw.model_rebuild()
    return {k: self.default(v) for k, v in obj.model_dump().items()}

The Pydantic case needs a small pre-pass or field iteration since model_dump() itself recursively converts; alternatively the LMR substitution can be done before model_dump() is called. Either approach preserves opacity for nested LMRs without regressing the LMR-in-dict round-trip the PR already restored.

(The write path — fern's jsonable_encoder — has the same structural issue and is worth tracking separately; the serializer fix alone closes the trace-side leak.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant