Wraps the Rust package html2md to convert HTML to Markdown. Note: Only markdown 1.0 is supported (no GFM, tables, etc).
I personally found this useful for doing RAG (Retrieval-Augmented Generation). If you're indexing actual webpages, the LLMs are generally fine with handling deeply nested messes. But, in my experience, converting to markdown provides the following benefits:
- Size reduction & Cost efficiency, because markdown is generally more compact than HTML
- Which in turn, allows the embeddings to capture greater meaningful semantic context
- Which also allows more logical chunking for the embeddings
Note: This is a GPL'd package as the underlying Html2Md Rust package is GPL. Be aware of what that implies for your use case.
If available in Hex, the package can be installed
by adding html2md_ex to your list of dependencies in mix.exs:
def deps do
[
{:html2md_ex, "~> 0.1.0"}
]
endThe parse_html function will return a Markdown string:
Html2MdEx.parse_html(html)