Include media elements (images, video, audio) in content_html and markdown exports

Currently, only images are extracted to their field in the result struct (`result.images`) but are stripped from `content_html` and `content_markdown` exports during document cleaning.

I created a PR (#4) with options to preserve all media elements in the HTML and markdown outputs:

1. **Images** - Add `img`, `figure`, `figcaption`, `picture`, and `source` tags to HTML extraction output when `include_images` is enabled. Also preserves original `content_html` during fallback extraction when it contains `<img>` tags.

2. **Video/Audio** - Add `include_videos` and `include_audio` options (default: `false`) to control extraction of `<video>`, `<audio>`, `<source>`, and `<track>` elements. These were previously stripped during document cleaning.

I implemented it such that it's only enabled with explicit flags and am open to further discussion and modifications. I think that it would go a long way to supporting broader production readiness (#1).

---

P.S. Thank you for this epic project! I created [NAPI bindings](https://github.com/gorango/napi-rs-trafilatura) to enable use in the Node ecosystem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include media elements (images, video, audio) in content_html and markdown exports #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Include media elements (images, video, audio) in content_html and markdown exports #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions