[draft] import_page: support file scheme and use bs4 to workaround missing 'body' element #456

jirib · 2025-02-17T18:28:50Z

Support file scheme and use bs4 to workaround missing 'body' element

v8/import_page/import_page.py

jirib · 2025-02-17T20:17:24Z

v8/import_page/import_page.py

+        document = doc_template.format(
+            title=title,
+            slug=slug,
+            content=node.prettify()


Should we leave the HTML as it is? If it is 'article' should I just get 'article' content (mostly likely the website template would already have 'article' element) ???

I tried to add a functionality for this topic, see ab15bed .

I think we should remove the wrapper element (e.g. <article>) by default.

Kwpolska · 2025-02-23T22:26:25Z

v8/import_page/import_page.py

+        while args:
+            arg = args.pop(0)
+            if arg == "-s" and args:
+                selector = args.pop(0)
+            elif arg == "-e" and args:
+                extractor = args.pop(0)
+            else:
+                urls.append(arg)  # Assume it's a page URL


You don’t need to parse args yourself, you should use the built-in support in doit. See just about any command plugin for an example.

Kwpolska · 2025-02-23T22:26:31Z

v8/import_page/import_page.py

+args = sys.argv[1:]
+selector = None # 'body'
+extractor = None # 'lambda node: BeautifulSoup(node.decode_contents(), "html.parser").prettify()'
+urls = []


Is this used?

Kwpolska · 2025-02-23T22:26:58Z

v8/import_page/import_page.py

+
+doc_template = '''<!--
+.. title: {title}
+.. slug: {slug}


Consider also adding date (defaulting to now is fine).

jirib added 2 commits February 17, 2025 19:27

support import from file, workaround with bs4 around page without body

f7757d1

move to v8 as it works on v8.3.1

1df98ec

felixfontein changed the title ~~Support file scheme and use bs4 to workaround missing 'body' element~~ import_page: support file scheme and use bs4 to workaround missing 'body' element Feb 17, 2025

Kwpolska reviewed Feb 17, 2025

View reviewed changes

v8/import_page/import_page.py Outdated Show resolved Hide resolved

jirib added 2 commits February 17, 2025 21:11

fix: html encoding remnant

a066672

fix: we want html not text

e8a5507

jirib commented Feb 17, 2025

View reviewed changes

introduce selector and extractor; fix requirements.txt

ab15bed

jirib force-pushed the support_file_scheme branch from 8c15716 to ab15bed Compare February 18, 2025 00:43

fix: repair remnant vars

82cda9e

jirib changed the title ~~import_page: support file scheme and use bs4 to workaround missing 'body' element~~ [draft] import_page: support file scheme and use bs4 to workaround missing 'body' element Feb 18, 2025

Kwpolska reviewed Feb 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] import_page: support file scheme and use bs4 to workaround missing 'body' element #456

[draft] import_page: support file scheme and use bs4 to workaround missing 'body' element #456

jirib commented Feb 17, 2025

jirib Feb 17, 2025

jirib Feb 18, 2025 •

edited

Loading

Kwpolska Feb 23, 2025

Kwpolska Feb 23, 2025

Kwpolska Feb 23, 2025

Kwpolska Feb 23, 2025

[draft] import_page: support file scheme and use bs4 to workaround missing 'body' element #456

Are you sure you want to change the base?

[draft] import_page: support file scheme and use bs4 to workaround missing 'body' element #456

Conversation

jirib commented Feb 17, 2025

jirib Feb 17, 2025

Choose a reason for hiding this comment

jirib Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Kwpolska Feb 23, 2025

Choose a reason for hiding this comment

Kwpolska Feb 23, 2025

Choose a reason for hiding this comment

Kwpolska Feb 23, 2025

Choose a reason for hiding this comment

Kwpolska Feb 23, 2025

Choose a reason for hiding this comment

jirib Feb 18, 2025 •

edited

Loading