Skip to content

Latest commit

 

History

History
187 lines (138 loc) · 5.62 KB

README.md

File metadata and controls

187 lines (138 loc) · 5.62 KB

html-to-markdown

A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork of markdownify with a modernized codebase, strict type safety and support for Python 3.9+.

Features

  • Full type safety with strict MyPy adherence
  • Functional API design
  • Extensive test coverage
  • Configurable conversion options
  • CLI tool for easy conversions
  • Support for pre-configured BeautifulSoup instances
  • Strict semver versioning

Installation

pip install html-to-markdown

Quick Start

Convert HTML to Markdown with a single function call:

from html_to_markdown import convert_to_markdown

html = '''
<article>
    <h1>Welcome</h1>
    <p>This is a <strong>sample</strong> with a <a href="https://example.com">link</a>.</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
    </ul>
</article>
'''

markdown = convert_to_markdown(html)
print(markdown)

Output:

# Welcome

This is a **sample** with a [link](https://example.com).

* Item 1
* Item 2

Working with BeautifulSoup

If you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:

from bs4 import BeautifulSoup
from html_to_markdown import convert_to_markdown

# Configure BeautifulSoup with your preferred parser
soup = BeautifulSoup(html, 'lxml')  # Note: lxml requires additional installation
markdown = convert_to_markdown(soup)

Advanced Usage

Customizing Conversion Options

The library offers extensive customization through various options:

from html_to_markdown import convert_to_markdown

html = '<div>Your content here...</div>'
markdown = convert_to_markdown(
    html,
    heading_style="atx",  # Use # style headers
    strong_em_symbol="*",  # Use * for bold/italic
    bullets="*+-",  # Define bullet point characters
    wrap=True,  # Enable text wrapping
    wrap_width=100,  # Set wrap width
    escape_asterisks=True,  # Escape * characters
    code_language="python"  # Default code block language
)

Configuration Options

Option Type Default Description
autolinks bool True Auto-convert URLs to Markdown links
bullets str '*+-' Characters to use for bullet points
code_language str '' Default language for code blocks
heading_style str 'underlined' Header style ('underlined', 'atx', 'atx_closed')
escape_asterisks bool True Escape * characters
escape_underscores bool True Escape _ characters
wrap bool False Enable text wrapping
wrap_width int 80 Text wrap width

For a complete list of options, see the Configuration section below.

CLI Usage

Convert HTML files directly from the command line:

# Convert a file
html_to_markdown input.html > output.md

# Process stdin
cat input.html | html_to_markdown > output.md

# Use custom options
html_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md

View all available options:

html_to_markdown --help

Migration from Markdownify

For existing projects using Markdownify, a compatibility layer is provided:

# Old code
from markdownify import markdownify as md

# New code - works the same way
from html_to_markdown import markdownify as md

The markdownify function is an alias for convert_to_markdown and provides identical functionality.

Configuration

Full list of configuration options:

  • autolinks: Convert valid URLs to Markdown links automatically
  • bullets: Characters to use for bullet points in lists
  • code_language: Default language for fenced code blocks
  • code_language_callback: Function to determine code block language
  • convert: List of HTML tags to convert (None = all supported tags)
  • default_title: Use default titles for elements like links
  • escape_asterisks: Escape * characters
  • escape_misc: Escape miscellaneous Markdown characters
  • escape_underscores: Escape _ characters
  • heading_style: Header style (underlined/atx/atx_closed)
  • keep_inline_images_in: Tags where inline images should be kept
  • newline_style: Style for handling newlines (spaces/backslash)
  • strip: Tags to remove from output
  • strong_em_symbol: Symbol for strong/emphasized text (* or _)
  • sub_symbol: Symbol for subscript text
  • sup_symbol: Symbol for superscript text
  • wrap: Enable text wrapping
  • wrap_width: Width for text wrapping
  • convert_as_inline: Treat content as inline elements

Contribution

This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before submitting PRs to avoid disappointment.

Local Development

  1. Clone the repo
  2. Install the system dependencies
  3. Install the full dependencies with uv sync
  4. Install the pre-commit hooks with:
    pre-commit install && pre-commit install --hook-type commit-msg
  5. Make your changes and submit a PR

License

This library uses the MIT license.

Acknowledgments

Special thanks to the original markdownify project creators and contributors.