-
Notifications
You must be signed in to change notification settings - Fork 13.3k
fix: add remark plugin to render raw HTML as literal text #16505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix: add remark plugin to render raw HTML as literal text #16505
Conversation
Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs do ensuring consistent and safe Markdown rendering Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the Markdown AST into plain-text equivalents while preserving indentation and line breaks. This ensures consistent rendering and prevents unintended HTML execution, without altering valid Markdown structure Kept 'remarkRehype' in the pipeline since it performs the required conversion from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization Refined the link-enhancement logic to skip unnecessary DOM rewrites, fixing a subtle bug where extra paragraphs were injected after the first line due to full innerHTML reconstruction, and ensuring links open in new tabs only when required Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml -> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify
Test sheet reasoning_content: Final content: This patch aligns the WebUI Markdown pipeline with industry-standard LLM renderers (OpenAI ChatGPT, Hugging Face Spaces, Anthropic...) by ensuring raw HTML safety without sacrificing formatting fidelity This patch doesn't just "sanitize HTML" : it neutralizes raw XML-like output (e.g. <think>, <tool>, <meta>, <response>, <step>, <node>, <data>), ensuring these symbolic or structural tags, whether produced by LLMs or part of generic XML fragments, are displayed as plain text rather than parsed as DOM, preserving structure while keeping the UI safe and consistent. |
To reproduce the issue now, you need to explicitly ask the model to output XML-like tags in the stream, which is already a bit of a hack, since LLMs naturally know they’re emitting Markdown. |
Test prompt : Write HTML with real blank lines and indentation inside a code block and then output the same HTML outside a code block, so we can compare the rendering. https://chatgpt.com/share/68ea480b-5c3c-8012-9201-62cfb687dc67 And also on llama.cpp with this PR : conversation_6b69f066-c32c-4b0b-9f0d-92dad9c31764_tu_peux_crire_exacte.json At this point, we’re actually doing slightly better than some major LLM WebUIs so that’s a good sign 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff overall! Just left a few architectural remarks that need to be addressed.
fix: add remark plugin to render raw HTML as literal text
Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs
do ensuring consistent and safe Markdown rendering
Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the
Markdown AST into plain-text equivalents while preserving indentation and
line breaks. This ensures consistent rendering and prevents unintended HTML
execution, without altering valid Markdown structure
Kept 'remarkRehype' in the pipeline since it performs the required conversion
from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization
Refined the link-enhancement logic to skip unnecessary DOM rewrites,
fixing a subtle bug where extra paragraphs were injected after the first
line due to full innerHTML reconstruction, and ensuring links open in new
tabs only when required
Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml
-> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify
Close #16417