Currently, only images are extracted to their field in the result struct (result.images) but are stripped from content_html and content_markdown exports during document cleaning.
I created a PR (#4) with options to preserve all media elements in the HTML and markdown outputs:
-
Images - Add img, figure, figcaption, picture, and source tags to HTML extraction output when include_images is enabled. Also preserves original content_html during fallback extraction when it contains <img> tags.
-
Video/Audio - Add include_videos and include_audio options (default: false) to control extraction of <video>, <audio>, <source>, and <track> elements. These were previously stripped during document cleaning.
I implemented it such that it's only enabled with explicit flags and am open to further discussion and modifications. I think that it would go a long way to supporting broader production readiness (#1).
P.S. Thank you for this epic project! I created NAPI bindings to enable use in the Node ecosystem.
Currently, only images are extracted to their field in the result struct (
result.images) but are stripped fromcontent_htmlandcontent_markdownexports during document cleaning.I created a PR (#4) with options to preserve all media elements in the HTML and markdown outputs:
Images - Add
img,figure,figcaption,picture, andsourcetags to HTML extraction output wheninclude_imagesis enabled. Also preserves originalcontent_htmlduring fallback extraction when it contains<img>tags.Video/Audio - Add
include_videosandinclude_audiooptions (default:false) to control extraction of<video>,<audio>,<source>, and<track>elements. These were previously stripped during document cleaning.I implemented it such that it's only enabled with explicit flags and am open to further discussion and modifications. I think that it would go a long way to supporting broader production readiness (#1).
P.S. Thank you for this epic project! I created NAPI bindings to enable use in the Node ecosystem.