Home

htmlParser — API Documentation

Overview

This header implements a lightweight HTML parser and DOM-like data structures with three main components:

HtmlElement — node structure representing elements and text ("plain") nodes.
HtmlDocument — thin wrapper around an HtmlElement root.
HtmlParser — streaming parser that turns HTML text into an HtmlDocument.

The parser uses std::shared_ptr<HtmlElement> for node references and provides selection helpers (id/class/tag, and a lightweight SelectElement rule mechanism).

Usage Example

// parse a string into a document and query elements
HtmlParser parser;
auto doc = parser.Parse(html_string);
auto root = doc->GetRoot();
// find first element with id "main"
auto e = root->GetElementById("main");
// get all <li> elements under root
auto lis = root->GetElementByTagName("li");
for (auto &li : lis) {
// li is std::shared_ptr<HtmlElement>
std::string text = li->GetValue();
}

Class: `HtmlElement`

Type: class HtmlElement : public enable_shared_from_this<HtmlElement>

Purpose: Represents a DOM node. It stores tag name, attributes, text value, class list, children and a weak parent pointer. Many selection and traversal helpers are implemented as member functions.

Constructors

HtmlElement() — default
HtmlElement(shared_ptr<HtmlElement> p) — constructs a node and sets its parent to p.

Public traversal & query methods

`GetAttribute(const std::string& k)`

Returns: attribute value or empty string if missing.

Example: std::string href = node->GetAttribute("href");

`SetAttribute(const std::string& name, const std::string& value)`

Sets or removes an attribute. If value is empty the attribute is erased. Special handling: when setting the "class" attribute the internal classlist cache is rebuilt from whitespace-separated tokens.

`GetAttributes()`

Returns a copy of the attribute std::map<std::string,std::string>.

`GetElementById(const std::string& id)`

Returns: shared_ptr<HtmlElement> — the first element in this subtree whose id attribute equals id, or an empty shared_ptr if none found.

Behavior: Performs a depth-first traversal over children. Stops at the first match and returns it.

`GetElementsById(const std::string& id)`

Returns a std::vector<shared_ptr<HtmlElement>> containing all elements in this subtree with the given id (the implementation collects results via a helper).

`GetElementsByClassName(const std::string& name)`

Returns a vector of elements with the CSS class name. (Uses a recursive helper to collect matches.)

`GetClassList() const`

Returns the parsed list of class tokens as std::vector<std::string>. This list is kept in sync when SetAttribute("class", ...) is used or when class-manipulation helpers below are called.

`HasClass(const std::string& cls)`

Returns true if the element's class list contains cls.

`AddClass/RemoveClass/ToggleClass/ClearClasses()`

Convenience helpers to manipulate the element's class list and keep the class attribute string synchronized via UpdateClassAttribute().

`GetElementByTagName(const std::string& name)`

Public wrapper that returns a vector of all descendant elements whose tag name matches name (case-insensitive via _stricmp in the implementation). Delegates to the private recursive helper.

`SelectElement(const std::string& rule, std::vector<shared_ptr<HtmlElement>>& result)`

Purpose: A lightweight XPath-like selector. See below under HtmlDocument Methods for more information.

Notes: This is a bespoke, simplified selector engine — consult the implementation before relying on complex XPath features.

`GetParent()`

Returns a shared_ptr<HtmlElement> to the parent or empty if none.

`GetSiblingNext(), GetSiblingPrev()`

Return the next/previous sibling element or nullptr if not present. They identify the node in the parent's child vector and return the adjacent element.

`GetChildren()`

Returns a copy of the internal children vector: std::vector<shared_ptr<HtmlElement>>.

`SetInnerText(std::string text)`

Replace the element's inner text. If there are no child nodes it creates a single text ("plain") child node and sets its value; otherwise it writes the value into the first child.

`SetInnerHTML(std::shared_ptr<HtmlElement> tempRoot)`

Replace current children with the children of the tempRoot element (used when parsing new HTML for insertion).

`GetValue()`

Returns the element's value member. If value is empty and the element has a single "plain" child, returns that child's value.

`GetName()`

Returns the element's tag name (a reference to the internal string).

`text(), PlainStylize(std::string&)`

Produce a text-only representation of the subtree (strips markup). Special tags (script/style, etc.) are skipped. This is used by text() to extract readable text.

`InnerHTML(), OuterHTML(), HtmlStylize(std::string&)`

Produce HTML serialization for the element (inner/outer). HtmlStylize is the recursive serializer; InnerHTML returns concatenation of serialized children or the element value when no children exist; OuterHTML wraps with start/end tags and attributes.

Notes about case sensitivity

Tag name comparisons use _stricmp inside the header implementation (case-insensitive).

Private helpers (selected)

`GetElementsByClassName(const std::string& cls, std::vector<shared_ptr<HtmlElement>>& result)`

Recursive collector used by the public GetElementsByClassName.

`GetElementsById(const std::string& id, std::vector<shared_ptr<HtmlElement>>& result)`

Builds an XPath using EscapeForXPath and delegates to SelectElement to collect matches (i.e. uses the selector engine).

`GetElementByTagName(const std::string& name, std::vector<shared_ptr<HtmlElement>>& result)`

Recursive collector used by the public wrapper.

`GetAllElement(std::vector<shared_ptr<HtmlElement>>& result)`

Append all descendant elements into result (depth-first).

`Parse(const std::string& attr)`

Parses an attribute string into the element's attribute map. Handles quoted values, whitespace separation and builds classlist when the class attribute is present.

`InsertIfNotExists(...)`

Utility to avoid duplicate pointers in result vectors (compares shared_ptr identity).

Class: `HtmlDocument`

Purpose: Thin wrapper holding the document root. Mostly delegates to the root element for querying and serialization.

`HtmlDocument(shared_ptr<HtmlElement>& root)`

Constructs document wrapper around root.

`GetRoot()`

Returns a shared_ptr<HtmlElement> to the document root.

`GetElementById/ GetElementsById / GetElementsByClassName / GetElementByTagName`

Delegates to analogous methods on the root element.

`SelectElement(const std::string& rule, std::vector<shared_ptr<HtmlElement>>& result)`

A very strict rule structure but one which should rarely, if ever, keep you from accomplishing what is needed.

1. All rules must start with / or //

2. The second token in the rule should be a tagName or * (all tags) Example: //DIV

3. The next part, if provided must be enclosed in brackets []

4. Within brackets the contents will look something like @id = 'item' or contains(@id,"it") or text(contains,"Jacket")

5. This currently doesn't support !. That is, negation. But plans are in place to add it.

Rule Syntax	Description	Example	Matches
/tag	Selects direct child elements (keep this thought on all rules starting with "/") with a given tag name starting at the document root level. More useful for XML.	/div	All <div> elements that are children.
//tag	Selects all descendants with a given tag name (recursive search) (keep this thought on all rules with "//"). Normally would use this "//" over "/".	//span	All <span> elements anywhere in descendants.
/*	Wildcard — matches any tag name.	/*	All immediate children from current context.
//*	Wildcard — matches any tag name in descendants.	//*	All descendants from current context.
//tag[@attr='value']	Matches elements with a specific attribute equal to value.	/a[@href='index.html']	<a href="index.html">.
//tag[@class='value']	Matches elements if class in class list equals specific value.	/div[@class='highlight']	<div class="highlight bold">.
//tag[contains(@attr,'substring')]	Matches if attribute contains substring	/div[contains(@class,'login')]	<div class="mylogin bold">.
//tag[starts-with(@attr,'prefix')]	Matches if attribute starts with prefix	/div[starts-with(@class,'bo')]	<div class="mylogin bold">.
//tag[ends-with(@attr,'suffix')]	Matches if attribute ends with suffix	/div[ends-with(@class,'ld')]	<div class="mylogin bold">.

`OuterHTML(), InnerHTML(), text()`

Return document-level serialized HTML and text. Note: There is a bug in the header: InnerHTML() in the header calls itself recursively (returns InnerHTML()) which produces infinite recursion. It should call root_->InnerHTML() or root_->HtmlStylize(...). Fix this if you use InnerHTML().

Class: `HtmlParser`

Purpose: Stream-based parser that tokenizes HTML and constructs an HtmlDocument.

`HtmlParser()`

Constructor initializes a set of known self-closing tags (br, hr, img, ...).

`Parse(const std::string& data)`

Convenience wrapper that calls Parse(const char* data, size_t len).

`Parse(const char* data, size_t len)`

Parses the input buffer and returns shared_ptr<HtmlDocument>. The parser iterates input and on encountering a '<' it calls ParseElement. The return value is a document wrapper around an internal root element.

`ParseElement(size_t index, shared_ptr<HtmlElement>& element)`

Internal recursive routine that reads a tag, its attributes, text nodes and child elements. It supports:

comments
processing instructions <? ... ?>
self-closing tags (based on the initialized set)
special handling for <script>, <style> and <noscript> to treat their contents as raw text

`SkipUntil(size_t index, const char* data)` and `SkipUntil(size_t index, char data)`

Helpers that advance the parse index until a substring or character is encountered. Used for skipping comments or finding closing tags.

Examples and recommended usage patterns

Basic parsing:

HtmlParser parser;
auto doc = parser.Parse(htmlString);
auto root = doc->GetRoot();
auto links = root->GetElementsByTagName("a");
for (auto &ln : links) {
    std::string href = ln->GetAttribute("href");
}

Modify inner HTML/text

auto node = root->GetElementById("content");
HtmlParser parser2;
auto newDoc = parser2.Parse("<div>new content</div>");
node->SetInnerHTML(newDoc->GetRoot());

Known issues & suggestions

HtmlDocument::InnerHTML bug: as noted above, the implementation recursively calls itself. Replace return InnerHTML(); with return root_->InnerHTML();.
Thread-safety: The parser and DOM are not thread-safe — shared_ptr is used for convenience but concurrent modifications will require synchronization.
Encoding: The parser treats input as raw bytes and uses std::string — it does not perform character-set conversions. Use UTF-8 input consistently.
Selector language: SelectElement implements a custom selector parser that is not full XPath but covers the most common needs.

Global helper functions

`toLowerW(const std::wstring&)`

Signature: static std::wstring toLowerW(const std::wstring& str)

Returns a lower-case copy of the provided wide string. Uses std::transform with ::tolower.

`toLower(const std::string&)`

Signature: static std::string toLower(const std::string& str)

Returns a lower-case copy of the provided narrow string. Useful for case-insensitive comparisons of tag names and attributes.

`EscapeForXPath(const std::string&)`

Signature: inline std::string EscapeForXPath(const std::string& value)

Returns a string suitable for embedding inside an XPath single-quoted literal. If the string contains no single quotes, it is returned unchanged; otherwise this function builds an concat(...) expression that preserves internal single quotes.

// Example:
EscapeForXPath("O'Reilly")  // returns: concat('O', "'", 'Reilly') style expression

Appendix: quick reference (selected signatures)

// HtmlElement (selected)
shared_ptr GetElementById(const std::string& id);
std::vector> GetElementsById(const std::string& id);
std::vector> GetElementsByClassName(const std::string& name);
std::vector> GetElementByTagName(const std::string& name);
int SetInnerText(std::string text);
int SetInnerHTML(std::shared_ptr tempRoot);
std::string InnerHTML();
std::string OuterHTML();
std::string text();
// HtmlDocument
shared_ptr GetRoot();
std::vector<shared_ptr> GetElementByTagName(const std::string& name);
// HtmlParser
shared_ptr Parse(const std::string& data);

Home

htmlParser — API Documentation

Overview

Usage Example

Class: HtmlElement

Constructors

Public traversal & query methods

GetAttribute(const std::string& k)

SetAttribute(const std::string& name, const std::string& value)

GetAttributes()

GetElementById(const std::string& id)

GetElementsById(const std::string& id)

GetElementsByClassName(const std::string& name)

GetClassList() const

HasClass(const std::string& cls)

AddClass/RemoveClass/ToggleClass/ClearClasses()

GetElementByTagName(const std::string& name)

SelectElement(const std::string& rule, std::vector<shared_ptr<HtmlElement>>& result)

GetParent()

GetSiblingNext(), GetSiblingPrev()

GetChildren()

SetInnerText(std::string text)

SetInnerHTML(std::shared_ptr<HtmlElement> tempRoot)

GetValue()

GetName()

text(), PlainStylize(std::string&)

InnerHTML(), OuterHTML(), HtmlStylize(std::string&)

Notes about case sensitivity

Private helpers (selected)

GetElementsByClassName(const std::string& cls, std::vector<shared_ptr<HtmlElement>>& result)

GetElementsById(const std::string& id, std::vector<shared_ptr<HtmlElement>>& result)

GetElementByTagName(const std::string& name, std::vector<shared_ptr<HtmlElement>>& result)

GetAllElement(std::vector<shared_ptr<HtmlElement>>& result)

Parse(const std::string& attr)

InsertIfNotExists(...)

Class: HtmlDocument

HtmlDocument(shared_ptr<HtmlElement>& root)

GetRoot()

GetElementById/ GetElementsById / GetElementsByClassName / GetElementByTagName

SelectElement(const std::string& rule, std::vector<shared_ptr<HtmlElement>>& result)

OuterHTML(), InnerHTML(), text()

Class: HtmlParser

HtmlParser()

Parse(const std::string& data)

Parse(const char* data, size_t len)

ParseElement(size_t index, shared_ptr<HtmlElement>& element)

SkipUntil(size_t index, const char* data) and SkipUntil(size_t index, char data)

Examples and recommended usage patterns

Known issues & suggestions

Global helper functions

toLowerW(const std::wstring&)

toLower(const std::string&)

EscapeForXPath(const std::string&)

Appendix: quick reference (selected signatures)

Clone this wiki locally

Class: `HtmlElement`

`GetAttribute(const std::string& k)`

`SetAttribute(const std::string& name, const std::string& value)`

`GetAttributes()`

`GetElementById(const std::string& id)`

`GetElementsById(const std::string& id)`

`GetElementsByClassName(const std::string& name)`

`GetClassList() const`

`HasClass(const std::string& cls)`

`AddClass/RemoveClass/ToggleClass/ClearClasses()`

`GetElementByTagName(const std::string& name)`

`SelectElement(const std::string& rule, std::vector<shared_ptr<HtmlElement>>& result)`

`GetParent()`

`GetSiblingNext(), GetSiblingPrev()`

`GetChildren()`

`SetInnerText(std::string text)`

`SetInnerHTML(std::shared_ptr<HtmlElement> tempRoot)`

`GetValue()`

`GetName()`

`text(), PlainStylize(std::string&)`

`InnerHTML(), OuterHTML(), HtmlStylize(std::string&)`

`GetElementsByClassName(const std::string& cls, std::vector<shared_ptr<HtmlElement>>& result)`

`GetElementsById(const std::string& id, std::vector<shared_ptr<HtmlElement>>& result)`

`GetElementByTagName(const std::string& name, std::vector<shared_ptr<HtmlElement>>& result)`

`GetAllElement(std::vector<shared_ptr<HtmlElement>>& result)`

`Parse(const std::string& attr)`

`InsertIfNotExists(...)`

Class: `HtmlDocument`

`HtmlDocument(shared_ptr<HtmlElement>& root)`

`GetRoot()`

`GetElementById/ GetElementsById / GetElementsByClassName / GetElementByTagName`

`SelectElement(const std::string& rule, std::vector<shared_ptr<HtmlElement>>& result)`

`OuterHTML(), InnerHTML(), text()`

Class: `HtmlParser`

`HtmlParser()`

`Parse(const std::string& data)`

`Parse(const char* data, size_t len)`

`ParseElement(size_t index, shared_ptr<HtmlElement>& element)`

`SkipUntil(size_t index, const char* data)` and `SkipUntil(size_t index, char data)`

`toLowerW(const std::wstring&)`

`toLower(const std::string&)`

`EscapeForXPath(const std::string&)`