Skip to content

Include media elements (images, video, audio) in content_html and markdown exports #5

@gorango

Description

@gorango

Currently, only images are extracted to their field in the result struct (result.images) but are stripped from content_html and content_markdown exports during document cleaning.

I created a PR (#4) with options to preserve all media elements in the HTML and markdown outputs:

  1. Images - Add img, figure, figcaption, picture, and source tags to HTML extraction output when include_images is enabled. Also preserves original content_html during fallback extraction when it contains <img> tags.

  2. Video/Audio - Add include_videos and include_audio options (default: false) to control extraction of <video>, <audio>, <source>, and <track> elements. These were previously stripped during document cleaning.

I implemented it such that it's only enabled with explicit flags and am open to further discussion and modifications. I think that it would go a long way to supporting broader production readiness (#1).


P.S. Thank you for this epic project! I created NAPI bindings to enable use in the Node ecosystem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions