-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This header implements a lightweight HTML parser and DOM-like data structures with three main components:
- HtmlElement — node structure representing elements and text ("plain") nodes.
- HtmlDocument — thin wrapper around an HtmlElement root.
- HtmlParser — streaming parser that turns HTML text into an HtmlDocument.
The parser uses std::shared_ptr<HtmlElement> for node references and provides selection helpers (id/class/tag, and a lightweight SelectElement rule mechanism).
// parse a string into a document and query elements
HtmlParser parser;
auto doc = parser.Parse(html_string);
auto root = doc->GetRoot();
// find first element with id "main"
auto e = root->GetElementById("main");
// get all <li> elements under root
auto lis = root->GetElementByTagName("li");
for (auto &li : lis) {
// li is std::shared_ptr<HtmlElement>
std::string text = li->GetValue();
}
Type: class HtmlElement : public enable_shared_from_this<HtmlElement>
Purpose: Represents a DOM node. It stores tag name, attributes, text value, class list, children and a weak parent pointer. Many selection and traversal helpers are implemented as member functions.
-
HtmlElement()— default -
HtmlElement(shared_ptr<HtmlElement> p)— constructs a node and sets its parent top.
Returns: attribute value or empty string if missing.
Example: std::string href = node->GetAttribute("href");
Sets or removes an attribute. If value is empty the attribute is erased. Special handling: when setting the "class" attribute the internal classlist cache is rebuilt from whitespace-separated tokens.
Returns a copy of the attribute std::map<std::string,std::string>.
Returns: shared_ptr<HtmlElement> — the first element in this subtree whose id attribute equals id, or an empty shared_ptr if none found.
Behavior: Performs a depth-first traversal over children. Stops at the first match and returns it.
Returns a std::vector<shared_ptr<HtmlElement>> containing all elements in this subtree with the given id (the implementation collects results via a helper).
Returns a vector of elements with the CSS class name. (Uses a recursive helper to collect matches.)
Returns the parsed list of class tokens as std::vector<std::string>. This list is kept in sync when SetAttribute("class", ...) is used or when class-manipulation helpers below are called.
Returns true if the element's class list contains cls.
Convenience helpers to manipulate the element's class list and keep the class attribute string synchronized via UpdateClassAttribute().
Public wrapper that returns a vector of all descendant elements whose tag name matches name (case-insensitive via _stricmp in the implementation). Delegates to the private recursive helper.
Purpose: A lightweight XPath-like selector. See below under HtmlDocument Methods for more information.
Notes: This is a bespoke, simplified selector engine — consult the implementation before relying on complex XPath features.
Returns a shared_ptr<HtmlElement> to the parent or empty if none.
Return the next/previous sibling element or nullptr if not present. They identify the node in the parent's child vector and return the adjacent element.
Returns a copy of the internal children vector: std::vector<shared_ptr<HtmlElement>>.
Replace the element's inner text. If there are no child nodes it creates a single text ("plain") child node and sets its value; otherwise it writes the value into the first child.
Replace current children with the children of the tempRoot element (used when parsing new HTML for insertion).
Returns the element's value member. If value is empty and the element has a single "plain" child, returns that child's value.
Returns the element's tag name (a reference to the internal string).
Produce a text-only representation of the subtree (strips markup). Special tags (script/style, etc.) are skipped. This is used by text() to extract readable text.
Produce HTML serialization for the element (inner/outer). HtmlStylize is the recursive serializer; InnerHTML returns concatenation of serialized children or the element value when no children exist; OuterHTML wraps with start/end tags and attributes.
Tag name comparisons use _stricmp inside the header implementation (case-insensitive).
Recursive collector used by the public GetElementsByClassName.
Builds an XPath using EscapeForXPath and delegates to SelectElement to collect matches (i.e. uses the selector engine).
Recursive collector used by the public wrapper.
Append all descendant elements into result (depth-first).
Parses an attribute string into the element's attribute map. Handles quoted values, whitespace separation and builds classlist when the class attribute is present.
Utility to avoid duplicate pointers in result vectors (compares shared_ptr identity).
Purpose: Thin wrapper holding the document root. Mostly delegates to the root element for querying and serialization.
Constructs document wrapper around root.
Returns a shared_ptr<HtmlElement> to the document root.
Delegates to analogous methods on the root element.
A very strict rule structure but one which should rarely, if ever, keep you from accomplishing what is needed.
1. All rules must start with / or //
2. The second token in the rule should be a tagName or * (all tags) Example: //DIV
3. The next part, if provided must be enclosed in brackets []
4. Within brackets the contents will look something like @id = 'item' or contains(@id,"it") or text(contains,"Jacket")
5. This currently doesn't support !. That is, negation. But plans are in place to add it.
|
Rule Syntax |
Description |
Example |
Matches |
|
/tag |
Selects direct child elements (keep this thought on all rules starting with "/") with a given tag name starting at the document root level. More useful for XML. |
/div |
All <div> elements that are children. |
|
//tag |
Selects all descendants with a given tag name (recursive search) (keep this thought on all rules with "//"). Normally would use this "//" over "/". |
//span |
All <span> elements anywhere in descendants. |
|
/* |
Wildcard — matches any tag name. |
/* |
All immediate children from current context. |
|
//* |
Wildcard — matches any tag name in descendants. |
//* |
All descendants from current context. |
|
//tag[@attr='value'] |
Matches elements with a specific attribute equal to value. |
/a[@href='index.html'] |
<a href="index.html">. |
|
//tag[@class='value'] |
Matches elements if class in class list equals specific value. |
/div[@class='highlight'] |
<div class="highlight bold">. |
|
//tag[contains(@attr,'substring')] |
Matches if attribute contains substring |
/div[contains(@class,'login')] |
<div class="mylogin bold">. |
|
//tag[starts-with(@attr,'prefix')] |
Matches if attribute starts with prefix |
/div[starts-with(@class,'bo')] |
<div class="mylogin bold">. |
|
//tag[ends-with(@attr,'suffix')] |
Matches if attribute ends with suffix |
/div[ends-with(@class,'ld')] |
<div class="mylogin bold">. |
Return document-level serialized HTML and text. Note: There is a bug in the header: InnerHTML() in the header calls itself recursively (returns InnerHTML()) which produces infinite recursion. It should call root_->InnerHTML() or root_->HtmlStylize(...). Fix this if you use InnerHTML().
Purpose: Stream-based parser that tokenizes HTML and constructs an HtmlDocument.
Constructor initializes a set of known self-closing tags (br, hr, img, ...).
Convenience wrapper that calls Parse(const char* data, size_t len).
Parses the input buffer and returns shared_ptr<HtmlDocument>. The parser iterates input and on encountering a '<' it calls ParseElement. The return value is a document wrapper around an internal root element.
Internal recursive routine that reads a tag, its attributes, text nodes and child elements. It supports:
- comments <!-- ... -->
- processing instructions <? ... ?>
- self-closing tags (based on the initialized set)
- special handling for <script>, <style> and <noscript> to treat their contents as raw text
Helpers that advance the parse index until a substring or character is encountered. Used for skipping comments or finding closing tags.
Basic parsing:
HtmlParser parser;
auto doc = parser.Parse(htmlString);
auto root = doc->GetRoot();
auto links = root->GetElementsByTagName("a");
for (auto &ln : links) {
std::string href = ln->GetAttribute("href");
}Modify inner HTML/text
auto node = root->GetElementById("content");
HtmlParser parser2;
auto newDoc = parser2.Parse("<div>new content</div>");
node->SetInnerHTML(newDoc->GetRoot());-
HtmlDocument::InnerHTML bug: as noted above, the implementation recursively calls itself. Replace
return InnerHTML();withreturn root_->InnerHTML();. - Thread-safety: The parser and DOM are not thread-safe — shared_ptr is used for convenience but concurrent modifications will require synchronization.
-
Encoding: The parser treats input as raw bytes and uses
std::string— it does not perform character-set conversions. Use UTF-8 input consistently. -
Selector language:
SelectElementimplements a custom selector parser that is not full XPath but covers the most common needs.
Signature: static std::wstring toLowerW(const std::wstring& str)
Returns a lower-case copy of the provided wide string. Uses std::transform with ::tolower.
Signature: static std::string toLower(const std::string& str)
Returns a lower-case copy of the provided narrow string. Useful for case-insensitive comparisons of tag names and attributes.
Signature: inline std::string EscapeForXPath(const std::string& value)
Returns a string suitable for embedding inside an XPath single-quoted literal. If the string contains no single quotes, it is returned unchanged; otherwise this function builds an concat(...) expression that preserves internal single quotes.
// Example:
EscapeForXPath("O'Reilly") // returns: concat('O', "'", 'Reilly') style expression// HtmlElement (selected)
shared_ptr GetElementById(const std::string& id);
std::vector> GetElementsById(const std::string& id);
std::vector> GetElementsByClassName(const std::string& name);
std::vector> GetElementByTagName(const std::string& name);
int SetInnerText(std::string text);
int SetInnerHTML(std::shared_ptr tempRoot);
std::string InnerHTML();
std::string OuterHTML();
std::string text();
// HtmlDocument
shared_ptr GetRoot();
std::vector<shared_ptr> GetElementByTagName(const std::string& name);
// HtmlParser
shared_ptr Parse(const std::string& data);