Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use beautifulsoup for HTML-sanitization #1327

Open
phorward opened this issue Nov 18, 2024 · 0 comments
Open

Use beautifulsoup for HTML-sanitization #1327

phorward opened this issue Nov 18, 2024 · 0 comments
Assignees
Labels
feature New feature or request refactoring Pull requests that refactor code but do not change its behavior. security For security related bugs

Comments

@phorward
Copy link
Member

This little example demonstrates how easy HTML sanitization might be with beautifulsoup:

from bs4 import BeautifulSoup

html_content = """
<html>
  <body>
    <h1 class="title" onclick="alert('bad')">Title</h1>
    <script>alert('This is malicious');</script>
    <p id="para1" style="color: red;">This is a paragraph.</p>
  </body>
</html>
"""

soup = BeautifulSoup(html_content, "html.parser")

# Remove specific tags
for tag in soup(["script", "style"]):
    tag.decompose()

# Sanitize attributes
allowed_attributes = {"p": ["id"], "h1": []}
for tag in soup.find_all(True):
    if tag.name in allowed_attributes:
        tag.attrs = {key: value for key, value in tag.attrs.items() if key in allowed_attributes[tag.name]}
    else:
        tag.attrs = {}  # Remove all attributes for tags not in the allowed list

print(soup.prettify())

We should consider this as part of #631

@phorward phorward added feature New feature or request security For security related bugs refactoring Pull requests that refactor code but do not change its behavior. labels Nov 18, 2024
@ArneGudermann ArneGudermann self-assigned this Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request refactoring Pull requests that refactor code but do not change its behavior. security For security related bugs
Projects
None yet
Development

No branches or pull requests

2 participants