Skip to content
jtaylorme edited this page Aug 13, 2025 · 5 revisions

htmlParser — API Documentation

 

Overview

This header implements a lightweight HTML parser and DOM-like data structures with three main components:

  • HtmlElement — node structure representing elements and text ("plain") nodes.
  • HtmlDocument — thin wrapper around an HtmlElement root.
  • HtmlParser — streaming parser that turns HTML text into an HtmlDocument.

The parser uses std::shared_ptr<HtmlElement> for node references and provides selection helpers (id/class/tag, and a lightweight SelectElement rule mechanism).

Usage Example

// parse a string into a document and query elements
HtmlParser parser;
auto doc = parser.Parse(html_string);
auto root = doc->GetRoot();

// find first element with id "main" auto e = root->GetElementById("main");

// get all <li> elements under root auto lis = root->GetElementByTagName("li"); for (auto &li : lis) { // li is std::shared_ptr<HtmlElement> std::string text = li->GetValue(); }


Class: HtmlElement

Type: class HtmlElement : public enable_shared_from_this<HtmlElement>

Purpose: Represents a DOM node. It stores tag name, attributes, text value, class list, children and a weak parent pointer. Many selection and traversal helpers are implemented as member functions.

Constructors

  • HtmlElement() — default
  • HtmlElement(shared_ptr<HtmlElement> p) — constructs a node and sets its parent to p.

Public traversal & query methods

GetAttribute(const std::string& k)

Returns: attribute value or empty string if missing.

Example: std::string href = node->GetAttribute("href");

SetAttribute(const std::string& name, const std::string& value)

Sets or removes an attribute. If value is empty the attribute is erased. Special handling: when setting the "class" attribute the internal classlist cache is rebuilt from whitespace-separated tokens.

GetAttributes()

Returns a copy of the attribute std::map<std::string,std::string>.

GetElementById(const std::string& id)

Returns: shared_ptr<HtmlElement> — the first element in this subtree whose id attribute equals id, or an empty shared_ptr if none found.

Behavior: Performs a depth-first traversal over children. Stops at the first match and returns it.

GetElementsById(const std::string& id)

Returns a std::vector<shared_ptr<HtmlElement>> containing all elements in this subtree with the given id (the implementation collects results via a helper).

GetElementsByClassName(const std::string& name)

Returns a vector of elements with the CSS class name. (Uses a recursive helper to collect matches.)

GetClassList() const

Returns the parsed list of class tokens as std::vector<std::string>. This list is kept in sync when SetAttribute("class", ...) is used or when class-manipulation helpers below are called.

HasClass(const std::string& cls)

Returns true if the element's class list contains cls.

AddClass/RemoveClass/ToggleClass/ClearClasses()

Convenience helpers to manipulate the element's class list and keep the class attribute string synchronized via UpdateClassAttribute().

GetElementByTagName(const std::string& name)

Public wrapper that returns a vector of all descendant elements whose tag name matches name (case-insensitive via _stricmp in the implementation). Delegates to the private recursive helper.

SelectElement(const std::string& rule, std::vector<shared_ptr<HtmlElement>>& result)

Purpose: A lightweight XPath-like selector. See below under HtmlDocument Methods for more information.

Notes: This is a bespoke, simplified selector engine — consult the implementation before relying on complex XPath features.

GetParent()

Returns a shared_ptr<HtmlElement> to the parent or empty if none.

GetSiblingNext(), GetSiblingPrev()

Return the next/previous sibling element or nullptr if not present. They identify the node in the parent's child vector and return the adjacent element.

GetChildren()

Returns a copy of the internal children vector: std::vector<shared_ptr<HtmlElement>>.

SetInnerText(std::string text)

Replace the element's inner text. If there are no child nodes it creates a single text ("plain") child node and sets its value; otherwise it writes the value into the first child.

SetInnerHTML(std::shared_ptr<HtmlElement> tempRoot)

Replace current children with the children of the tempRoot element (used when parsing new HTML for insertion).

GetValue()

Returns the element's value member. If value is empty and the element has a single "plain" child, returns that child's value.

GetName()

Returns the element's tag name (a reference to the internal string).

text(), PlainStylize(std::string&)

Produce a text-only representation of the subtree (strips markup). Special tags (script/style, etc.) are skipped. This is used by text() to extract readable text.

InnerHTML(), OuterHTML(), HtmlStylize(std::string&)

Produce HTML serialization for the element (inner/outer). HtmlStylize is the recursive serializer; InnerHTML returns concatenation of serialized children or the element value when no children exist; OuterHTML wraps with start/end tags and attributes.

Notes about case sensitivity

Tag name comparisons use _stricmp inside the header implementation (case-insensitive).


Private helpers (selected)

GetElementsByClassName(const std::string& cls, std::vector<shared_ptr<HtmlElement>>& result)

Recursive collector used by the public GetElementsByClassName.

GetElementsById(const std::string& id, std::vector<shared_ptr<HtmlElement>>& result)

Builds an XPath using EscapeForXPath and delegates to SelectElement to collect matches (i.e. uses the selector engine).

GetElementByTagName(const std::string& name, std::vector<shared_ptr<HtmlElement>>& result)

Recursive collector used by the public wrapper.

GetAllElement(std::vector<shared_ptr<HtmlElement>>& result)

Append all descendant elements into result (depth-first).

Parse(const std::string& attr)

Parses an attribute string into the element's attribute map. Handles quoted values, whitespace separation and builds classlist when the class attribute is present.

InsertIfNotExists(...)

Utility to avoid duplicate pointers in result vectors (compares shared_ptr identity).


Class: HtmlDocument

Purpose: Thin wrapper holding the document root. Mostly delegates to the root element for querying and serialization.

HtmlDocument(shared_ptr<HtmlElement>& root)

Constructs document wrapper around root.

GetRoot()

Returns a shared_ptr<HtmlElement> to the document root.

GetElementById/ GetElementsById / GetElementsByClassName / GetElementByTagName

Delegates to analogous methods on the root element.

SelectElement(const std::string& rule, std::vector<shared_ptr<HtmlElement>>& result)

A very strict rule structure but one which should rarely, if ever, keep you from accomplishing what is needed.

1. All rules must start with / or //

2. The second token in the rule should be a tagName or * (all tags) Example: //DIV

3. The next part, if provided must be enclosed in brackets []

4. Within brackets the contents will look something like @id = 'item' or contains(@id,"it") or text(contains,"Jacket")

5. This currently doesn't support !. That is, negation. But plans are in place to add it.

 

Rule Syntax

Description

Example

Matches

/tag

Selects direct child elements

(keep this thought on all rules starting with "/") with a given tag name starting at the document root level. More useful for XML.

/div

All <div> elements that are children.

//tag

Selects all descendants with a given tag name (recursive search)

(keep this thought on all rules with "//"). Normally would use this "//" over "/".

//span

All <span> elements anywhere in descendants.

/*

Wildcard — matches any tag name.

/*

All immediate children from current context.

//*

Wildcard — matches any tag name in descendants.

//*

All descendants from current context.

//tag[@attr='value']

Matches elements with a specific attribute equal to value.

/a[@href='index.html']

<a href="index.html">.

//tag[@class='value']

Matches elements if class in class list equals specific value.

/div[@class='highlight']

<div class="highlight bold">.

//tag[contains(@attr,'substring')]

Matches if attribute contains substring

/div[contains(@class,'login')]

<div class="mylogin bold">.

//tag[starts-with(@attr,'prefix')]

Matches if attribute starts with prefix

/div[starts-with(@class,'bo')]

<div class="mylogin bold">.

//tag[ends-with(@attr,'suffix')]

Matches if attribute ends with suffix

/div[ends-with(@class,'ld')]

<div class="mylogin bold">.

OuterHTML(), InnerHTML(), text()

Return document-level serialized HTML and text. Note: There is a bug in the header: InnerHTML() in the header calls itself recursively (returns InnerHTML()) which produces infinite recursion. It should call root_->InnerHTML() or root_->HtmlStylize(...). Fix this if you use InnerHTML().


Class: HtmlParser

Purpose: Stream-based parser that tokenizes HTML and constructs an HtmlDocument.

HtmlParser()

Constructor initializes a set of known self-closing tags (br, hr, img, ...).

Parse(const std::string& data)

Convenience wrapper that calls Parse(const char* data, size_t len).

Parse(const char* data, size_t len)

Parses the input buffer and returns shared_ptr<HtmlDocument>. The parser iterates input and on encountering a '<' it calls ParseElement. The return value is a document wrapper around an internal root element.

ParseElement(size_t index, shared_ptr<HtmlElement>& element)

Internal recursive routine that reads a tag, its attributes, text nodes and child elements. It supports:

  • comments <!-- ... -->
  • processing instructions <? ... ?>
  • self-closing tags (based on the initialized set)
  • special handling for <script>, <style> and <noscript> to treat their contents as raw text

SkipUntil(size_t index, const char* data) and SkipUntil(size_t index, char data)

Helpers that advance the parse index until a substring or character is encountered. Used for skipping comments or finding closing tags.


Examples and recommended usage patterns

Basic parsing:

HtmlParser parser;
auto doc = parser.Parse(htmlString);
auto root = doc->GetRoot();
auto links = root->GetElementsByTagName("a");
for (auto &ln : links) {
    std::string href = ln->GetAttribute("href");
}

Modify inner HTML/text

auto node = root->GetElementById("content");
HtmlParser parser2;
auto newDoc = parser2.Parse("<div>new content</div>");
node->SetInnerHTML(newDoc->GetRoot());

Known issues & suggestions

  • HtmlDocument::InnerHTML bug: as noted above, the implementation recursively calls itself. Replace return InnerHTML(); with return root_->InnerHTML();.
  • Thread-safety: The parser and DOM are not thread-safe — shared_ptr is used for convenience but concurrent modifications will require synchronization.
  • Encoding: The parser treats input as raw bytes and uses std::string — it does not perform character-set conversions. Use UTF-8 input consistently.
  • Selector language: SelectElement implements a custom selector parser that is not full XPath but covers the most common needs.

Global helper functions

toLowerW(const std::wstring&)

Signature: static std::wstring toLowerW(const std::wstring& str)

Returns a lower-case copy of the provided wide string. Uses std::transform with ::tolower.

toLower(const std::string&)

Signature: static std::string toLower(const std::string& str)

Returns a lower-case copy of the provided narrow string. Useful for case-insensitive comparisons of tag names and attributes.

EscapeForXPath(const std::string&)

Signature: inline std::string EscapeForXPath(const std::string& value)

Returns a string suitable for embedding inside an XPath single-quoted literal. If the string contains no single quotes, it is returned unchanged; otherwise this function builds an concat(...) expression that preserves internal single quotes.

// Example:
EscapeForXPath("O'Reilly")  // returns: concat('O', "'", 'Reilly') style expression

Appendix: quick reference (selected signatures)

// HtmlElement (selected)
shared_ptr GetElementById(const std::string& id);
std::vector> GetElementsById(const std::string& id);
std::vector> GetElementsByClassName(const std::string& name);
std::vector> GetElementByTagName(const std::string& name);
int SetInnerText(std::string text);
int SetInnerHTML(std::shared_ptr tempRoot);
std::string InnerHTML();
std::string OuterHTML();
std::string text();

// HtmlDocument shared_ptr GetRoot(); std::vector<shared_ptr> GetElementByTagName(const std::string& name);

// HtmlParser shared_ptr Parse(const std::string& data);

 

Clone this wiki locally