Lecture notes by: Michael Hahsler
Diagrams for Chapter 4
This chapter introduces data structures used for search based on trees.
Lists have for many operations
Recursive definition: A tree
- is a collection of nodes which contain the information stored in the tree.
- is empty or has a single root node
$r$ . - has zero or more nonempty subtrees
$T_1, T_2, ..., T_k$ whose roots are connected by a directed edge from$r$ .
More naming conventions:
- Two nodes are called parent and child of each other if the parent as a directed edge to the child.
- Nodes without children are called leaves.
- A path is a sequence along edges. The path length is the number of edges.
- The depth of a node is the length of the path from the root node
$r$ to the node (there is only on path). - The path from a node to the farthest leaf node is called the node's height. Note that the height is the number of levels minus 1.
- A tree where a node can have at most
$M$ children is called an$M$ -ary tree.
Observations:
- A tree is a special type of graph with only a single path from the root to each leaf.
- A collection of
$N$ nodes (one is the root node) has$N - 1$ edges.
We implement a tree as a set of connected Nodes
. The data that is stored in each node is often called the key.
Each node in a general tree (std::forward_list
.
template <typename Object>
class MaryTree {
private:
class Node
{
public:
Object key; // the data that the node stores
std::forward_list<Node> children;
};
Node *root;
public:
// Rule-of-3 + functions to add, find and remove nodes
};
Note: Some implementations only use the node class and represent the tree as a reference or pointer to the root node.
Example: Unix file system. Try the tree
command on a linux machine. It performs a preorder traversal (depth-first; children are processed after the parent).
A binary tree is a tree in which no node can have more than two children.
Complexity: Operations are
The node structure is similar to a linked list node
template <typename Object>
...
class Node
{
public:
Object key;
Node *left;
Node *right;
}
nullptr
represents missing children.
Visiting all nodes in a tree is called tree traversal. There are several ways we can traverse a tree.
Depth-first traversal works for any m-ary tree, but we define it here for the special case of binary trees. We will use the following notation:
- N stands for processing the node. For example, printing the value stored in the node.
- L means process the left subtree.
- R means process the right subtree.
Once a leaf node is reached and processed, then we go back to the node we came from to process a different subtree. This is often called backtracking. i The three traversal orders are:
- preorder (NLR): process node before children. Pre means the node is processed first.
- inorder (LNR) first follow the left child and then process the node before following down the right child. In means that the node is processed in the middle. For binary search trees this results in processing the nodes in sort order.
- postorder (LRN): process the left and then the right subtree before the node is processed. Post means the node is processed last.
Recursive implementation
void traverseNLR(Node* n) {
cout << n->key; // N
traverseNLR(n->left); // L
traverseNLR(n->right); // R
}
void traverseLNR(Node* n) {
traverseLNR(n->left); // L
cout << n->key; // N
traverseLNR(n->right); // R
}
void traverseLRN(Node* n) {
traverseLRN(n->left); // L
traverseLRN(n->right); // R
cout << n->key; // N
}
For an iterative implementation, you need a stack.
Also called level-order: process the tree by level (may need an extra data structure like a queue to store unprocessed nodes). You will learn more about breadth-first traversal when you learn about graphs in algorithms.
An expression tree represents an expression with binary operators like
Construction
If we have infix notation then we can convert it first to postfix notation using a stack:
6 * (5 + (2 + 3) * 8 + 3) => 6 5 2 3 + 8 * + 3 + *
Algorithm:
Read the expression left-to-right. Do the following for:
- Operand: Add any operand (here a number) directly to the output.
- `(`: push ( on the stack.
- `)`: pop all operators from the stack and add them to the output till an opening
parenthesis is reached on the stack (also pop the `(` and ignore it).
- Operator: pop all operators with higher or equal precedence (order of operations)
than the new operator from the stack and add them to the output (if any).
Push the new operator on the stack.
Once the expression is done, pop any remaining operator and add it to the output.
We can create an expression tree from postfix notation with a stack.
Algorithm:
Read postfix expression one symbol at a time.
- Operand: create a new node for the operand and push a pointer to the node on the stack.
- Operator:
a. pop two operand nodes from the stack and
b. create a tree with the operator as its root, the first operand from the stack as the
right child and the second as the left child.
c. push a pointer to the tree on the stack
Once the expression is completely processes, you should have a pointer to the expression tree on the stack.
Applications of the parse tree: create
- infix notation: inorder tree transversal (LNR) = left expression in parentheses then operator and then the right subtree in parentheses.
- postfix notations: postorder tree traversal (LRN) = process operator after the children (no need for parentheses).
- prefix notations: preorder tree traversal (NLR) = process operator first (no need for parentheses).
Parse trees are
- Draw the stack operations to convert
$6 * (5 + (2 + 3) * 8 + 3)$ into postfix notation. - Convert the postfix notation into an expression tree.
- Convert the expression tree back into infix notation.
STL implementations of Binary Search trees: STL provides
std::set and std::map based on binary search trees.
The stored key objects need to be Comparable
with a definition of bool operator<(const &) const
or a function object
(see comparator example in Chapter 1).
Examples: How to use STL sets and maps
Example: Binary Tree Search is equivalent to Binary Search. A balanced search tree can be stored compactly in a vector using inorder traversal.
Maps are associative containers that relate a key to a value. In programming we typically have maps that map the set of keys onto the set of values in a 1-1 fashion (a bijection).
The purpose is to access the value fast given that we know the key. Note that a value can also be a collection of individual values (e.g., a list or a vector).
Requirements:
- keys need to be unique.
- keys need to have an order defined (be comparable with
operator<()
).
Implementations:
-
A binary search tree. The node has in addition to the key also a member variable to store the value.
template <typename KeyType, typename ValueType> ... class MapTreeNode { public: KeyType key; ValueType value; BinaryNode *left; BinaryNode *right; }
-
A hash table (we will talk about hashing later).
STL provides std::map. Here is a code example.
Multimaps can associate multiple values with a key. The STL provides stl::multimap.
We need secondary storage access for data that does not fit into main memory. Secondary storage is typically organized in blocks (file system, disk) and access is slow compared to main memory access (takes 100,000x for random access) so the Big-Oh method does not work (remember that it assumes that all operations take the same amount of time).
Problem: Every level in the tree requires a storage access.
Idea: Reduce tree depth by making the tree wider leading to a balanced M-ary search tree.
Common implementations are the B-tree and the B+ tree.
B-tree properties:
- Data is organized in sorted order.
- All data is stored in the leaf nodes as a sorted array with space for
$L$ values. At least$ceil(L/2)$ are occupied. - Each non-leaf node has space for
$M$ key/pointer pairs. At least$ceil(M/2)$ are occupied. The key is the minimum value of the entries following the pointer. -
$M$ and$L$ are chosen to fit into a storage block that can be accessed with a single read.
Pointers are block addresses.
The requirement that at least half the places are filled balances the tree. Insertion may lead to a split of a node into two half full nodes.
Operations are
- File systems: Quick random access to an arbitrary block in a particular file.
- Databases: Store auxiliary index structures for faster retrieval in very large databases.
All code and documents in this repository are provided under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License.