ThaicomGroup.com: Cây nhị phân và cây tìm kiếm nhị phân

Tuesday, March 20, 2012

Cây nhị phân và cây tìm kiếm nhị phân

Bài giảng cho lớp cấu trúc dữ liệu. Bạn nào rảnh đọc thì góp ý giùm, nhất là phần C++.
Bài trước: Danh sách liên kết và danh sách nhảy cóc
Bài sau: các cây nhị phân cân bằng.

1. Binary trees

1.1. Terminologies

A binary tree is a set of nodes connected in the following way. There’s a designated node called the root. Every node has two pointers called left and right. Each pointer either points to another node or is equal to NULL. All nodes are reachable from the root by a sequence of left/right pointers. We have seen a binary tree before when we discussed heap sort:

A leaf node is a node both of whose pointers are NULL; non-leaf nodes are called internal nodes.

Fix a node v of the tree. The node pointed to by the left pointer of v is called the left child and the node pointed to by the right pointer of v is called the right child of v, respectively. A node on the path from the root to v is called an ancestor of v. A node is an ancestor of itself. A node reachable from v via a sequence of left/right pointers is called a descendent of v. A node is a descendent of itself.

The depth of v is the distance from the root to v, and the height of v is the longest distance from v to one of its descendents (which must be a leaf).

The left sub-tree of a node v is the tree whose root is the left childe of v. Similarly, we can define the right-subtree of v.

A full binary tree is a tree in which each internal node has exactly two children (which are not NULL). A complete binary tree is a binary tree in which all levels at all depths are filled up, except for possibly the last level which is filled from left to right.

1.2. Expression trees, traversing binary trees

The nodes typically store satellite data (in addition to the two pointers), depending on what we want to do with the trees. Let us present a simple binary tree’s node representation in C++:


/*
 * -----------------------------------------------------------------------------
 * A tree is simply a pointer to a BTNode
 * -----------------------------------------------------------------------------
 */
template <typename Item>
struct BTNode {
    Item payload;
    BTNode* left;
    BTNode* right;
    // by default, a node is a leaf; r/l are NULL
    BTNode(const Item& item = Item(), BTNode* l = NULL, BTNode* r = NULL)
        : payload(item), left(l), right(r) {}
};

Arithmetic expression trees are binary trees that can unambiguously represent arithmetic expressions without parentheses. In this case, each node contains an operator or an operand: internal nodes are operators and leaves are operands. What is interesting is that by traversing the expression tree in some order, we can recover the infix, postfix, and prefix representations of the expression easily. There are at least 8 orders in which we can visit the nodes of a binary tree.

Inorder traversal: starting from the root, we visit nodes of the tree in the the following way. Recursively inorder-traverse the left subtree, visit the root, then recursively inorder-traverse the right subtree. Inorder means the root is visited in between the visits to the subtrees. The pseudo code for this traversal can be written as:
```
Inorder-Traverse(root) {
    if (root != NULL) {
        Inorder-Traverse(root->left);
        Visit(root);
        Inorder-Traverse(root->right);
    }
}
```
The more compact way of describing the inorder traversal of a binary tree is to say that we traverse the tree in the (left, node, right) oder.
Reverse inorder traversal: (right, node, left).
Preorder traversal: (node, left, right).
Reverse preorder traversal: (node, right, left).
Postorder traversal: (left, right, node).
Reverse postorder traversal: (right, left, node).

All of these 6 traversal orders are depth first in the sense that we completely explore one branch of a sub-tree before exploring the other branch; and thus we always dive down as deep as possible. There are two other traversal ordering which are breadth-first: we explore nodes one depth at a time, starting with nodes at depth 0 (the root), then nodes at depth 1 (children of the root), then nodes at depth 2 (children of children of the root) and so forth.

Level order traversal: list nodes level by level, left to right
Reverse level order traversal: list nodes level by level, right to left

What do we mean by “visiting” a node? Well, that depends on what we want to do with the traversal. In the simplest case, we might just want to print out the content of the node when we visit it. Then, the inorder traversal procedure can be written in C++ as follows.


/*
 * -----------------------------------------------------------------------------
 * recursive inorder traverse & print nodes
 * -----------------------------------------------------------------------------
 */
template <typename T>
void inorder_print(BTNode<T>* root) {
    if (root != NULL) {
        inorder_print(root->left);
        cout << root->payload << " ";
        inorder_print(root->right);
    }
}

Here’s a sample output

Similarly, the preorder traversal function can be written as


/*
 * -----------------------------------------------------------------------------
 * recursive preorder traverse & print nodes
 * -----------------------------------------------------------------------------
 */
template <typename T>
void preorder_print(BTNode<T>* root) {
    if (root != NULL) {
        cout << root->payload << " ";
        preorder_print(root->left);
        preorder_print(root->right);
    }
}

Here’s an example of a preorder sequence of payloads:

and, the following is the postorder version


/*
 * -----------------------------------------------------------------------------
 * recursive postorder traverse & print nodes
 * -----------------------------------------------------------------------------
 */
template <typename T>
void postorder_print(BTNode<T>* root) {
    if (root != NULL) {
        postorder_print(root->left);
        postorder_print(root->right);
        cout << root->payload << " ";
    }
}

You can test those functions with the following snippet:


int main() {
    // first, test it with a simple expression tree 2*(4+3)-6
    // the following expression reminds me a lot of scheme/lisp
    BTNode<string>* tree =
        new BTNode<string>(
            "-",
            new BTNode<string>(
                "*",
                new BTNode<string>("2"),
                new BTNode<string>(
                    "+",
                    new BTNode<string>("4"),
                    new BTNode<string>("3")
                    )
                ),
            new BTNode<string>("6")
        );

    inorder_print(tree);   cout << endl; // doesn't print yet (..)
    preorder_print(tree);  cout << endl;
    postorder_print(tree); cout << endl;
    return 0;
}

If we In/Post/Pre-Order traverse an expression tree, we obtain the corresponding in/post/pre-fix representation of the expression. The only slight difference is in the infix representation printing, where we have to print a pair of parentheses around an expression; something along the following line:


// you can check whether v is the root and not print the (..)
Inorder-Traverse(Node v)
   If (v != NULL) {
       if (v.payload is an operator) {
           Print("(");
           Inorder_Traverse(v.left);
           Print(v.payload);
           Inorder_Traverse(v.right);
           Print(")");
       } else {
           Print(v.payload);
       }
   }

Exercise: write the C++ procedure for the other three traversal oder, where “visit” = “print”.

Exercise: suppose payloads are all integers. Present an example tree in which all 8 traversal orders print the same sequence. Next, assume that all nodes’ payloads are distinct. Show that from a combination of inorder and postorder printouts, we can uniquely re-construct the tree.

We discuss the breadth-first traversal in the next section.

1.3. Counting trees (you can skip this section)

Exercise: suppose payloads are all distinct integers. In the previous exercise we showed that inorder + postorder payload sequences uniquely determine the tree. Is it true that given any 2 out of 6 possible traversal sequences we can uniquely determine the tree?

Suppose all payloads are distinct integers. Without loss of generality, we can assume that all payloads are integers from 1 to n. It is an interesting exercise to determine the number of trees which give the same in/post/pre-order traversal sequence.

Consider any particular inorder traversal sequence $a_1,\dots, a_n$ , which is a permutation of $[n] = \{1,\dots,n\}$ . The root could be any of $a_1,\dots,a_n$ . Say, the root’s payload is a_i . Then, the sequence from $a_1,\dots,a_{i-1}$ is the inorder traversal sequence of the left sub-tree of the root and the sequence $a_{i+1},\dots,a_n$ is the inorder traversal sequence of the right sub-tree of the root. Hence, if we let C_n denote the number of binary trees which yield the $a_1,\dots,a_n$ inorder traversal sequence, we have C_0 = 1 (the NULL tree) and

$C_n = \sum_{i=1}^n C_{i-1}C_{n-i} = \sum_{j=0}^{n-1}C_jC_{n-j-1}$

How do we solve this recurrence? The easiest way is to use the generating function method. Define the generating function for the sequence $C_0, C_1,\dots,C_n,\dots$ by

$g(x) = C_0 + C_1x + C_2x^2 + \dots$

Then,

$\begin{array}{rcl} g^2(x) &=& \left(\sum_{i=0}^\infty C_ix^i\right)^2\\ &=& \sum_{n=0}^\infty \left(\sum_{i=0}^n C_iC_{n-i}\right)x^n\\ &=& \sum_{n=0}^\infty C_{n+1}x^n\\&=& \frac{g(x)-1}{x}\end{array}$

Solving the quadratic equation gives us

$g(x) = \frac{1/x \pm \sqrt{1/x^2 - 4/x}}{2} = \frac{1\pm \sqrt{1-4x}}{2x}$

From Newton’s generalized binomial formula we obtain

$(1-4x)^{1/2} = 1 - 2 \sum_{k=1}^\infty \frac 1 k \binom{2k-2}{k-1}x^k$

As the coefficients of g(x) are all non-negative, we conclude that

$g(x) = \frac{1-\sqrt{1-4x}}{2x} = \sum_{n=0}^\infty \frac{1}{n+1}\binom{2n}{n}x^n$

and hence $C_n = \frac{1}{n+1}\binom{2n}{n}$ . This is the nth Catalan number, a sequence of integers occuring in a gazillion of situations in Enumerative Combinatorics.

Exercise: what is the number of binary trees with the same preorder sequence? (Assumming all the payloads are distinct.) Repeat the question with postorder and level-order.

1.4. Run time analysis of the depth first binary tree traversals

If we were only printing the content of the nodes, then Inorder-Traverse (and the other five for that matter) runs in linear time (i.e. O(n)-time). To see this, let T(n) denotes the time it takes Inorder-Traverse to print all nodes in a binary tree with n nodes. Let k be the number of nodes on the left branch of the root and thus n-k-1 is the number of nodes on the right branch. Then, T(n) = T(k) + T(n-k-1) + c for some constant c. And, T(0) = d for some constant d. We can iterate the recurrence a few times as we did with the Fibonacci example:


T(0) = d
T(1) = 2T(0) + c = 2d + c
T(2) = T(0) + T(1) + c = 3d + 2c

At T(3), there are two choices depending on the structure of the tree:


T(3) = 2T(1) + c       = 4d + 3c
or
T(3) = T(0) + T(2) + c = 4d + 3c

So, it seems that in general we have T(n) = (n+1)d + nc. Let’s prove it by induction:


T(n) = T(k) + T(n-k-1) + c
     = (k+1)d + kc + (n-k)d + (n-k-1)c + c
     = (n+1)d + nc.

Thus, overall we have T(n) = O(n). The same holds for the other two traversal orders. (Again, this is assuming when we visit a node we do a constant amount of work.)

Exercise: write a non-recursive routine to perform an inorder tree walk. You can try to use a stack first. Then, write one without a stack.

1.5. Level-order traversal

To print nodes out level by level, we use a first-in-first-out (FIFO) queue. Initially, the queue has the root in it. While the queue is not yet empty, extract the node on queue’s end, “visit” it, and push the node’s children onto the queue. It is not hard to see that this process visits all nodes by levels.


/*
 * -----------------------------------------------------------------------------
 * level-order traverse & print nodes
 * -----------------------------------------------------------------------------
 */
template <typename T>
void levelorder_print(BTNode<T>* root) {
    if (root != NULL) {
        deque<BTNode<T>*> q;
        q.push_front(root);
        while (!q.empty()) {
            BTNode<T>* cur = q[q.size()-1];
            q.pop_back();
            if (cur->left != NULL) q.push_front(cur->left);
            if (cur->right != NULL) q.push_front(cur->right);
            cout << cur->payload << " ";
        }
        cout << endl;
    }
}

1.6. Huffman trees

In the standard ASCII encoding, each character is represented by 1 byte or 8 bits. This encoding is convenient for programmers and system designers for a variety of reasons. However, purely in terms of data compression it is not very good. Here’s the well-known frequency table for English letters occurring in a random document:

Letters e, t occur much more often than letters j,x,z. Hence, it would only make sense to encode the more frequently occuring letters in a document using less number of bits and vice versa. Our only concern is to be able to decode the document on the other end of the communication channel. One way to represent an encoding is by a binary tree, where the left branch represent a 0 and the right branch represent a 1 bit. Internal nodes do not have satellite data. Leaf nodes corresond to letters of the English alphabet. The encoding of a letter is “read” from the tree by going down the 0/1 branches of the tree to the leaf labeled by that letter. This type of encoding is called a prefix code (or prefix free code) because no encoding of a character is a prefix of the encoding of another character. For example,

It should be noted that the encoding tree should be a full binary tree, because if an internal node has only one child then it can be removed without violating the prefix-free condition. For every letter c in the alphabet, if f(c) is its frequency and d(c) is the depth of it in the prefix code tree, then the total number of bits needed to minimize the encoding length of the document is $\sum_{c\in \Sigma} f(c)d(c)$ , where $\Sigma$ is the alphabet.

If we know the frequencies of letters in a document, there is a well-known greedy algorithm (by Huffman) which builds an optimal binary encoding tree for the document: this is the binary tree that minimizes the sum $\sum_{c\in \Sigma} f(c)d(c)$ . The resulting tree is called a Huffman tree.

2. Binary search trees (BST)

Managing a collection of (key, value) pairs is one of the most fundemantal problem in computing. When we log-on to a computer at school, our user name is the key and our password is the value. Google’s Map Reduce is a software framework for computing distributively on a huge collection of (key, value) pairs. Database indexing is the problem of storing values efficiently allowing for lookups using corresponding keys. DNS is an internet service for looking up IPs given domain names: domain names are keys, and IPs are values. We can give a billion more examples.

Binary search trees are one of the most basic data structures for storing (key, value) pairs. A binary search tree (BST) is a binary tree in which each node holds a “key” and possibly additional satellite data. All keys come from a total-order domain such as integers or strings (lexicographically, e.g.). The keys satisfy the following property:

Binary Search Tree (BST) property: for any node v, we have x->key ≤ v->key ≤ y->key for any nodes x on the left sub-tree of v and any node y on the right sub-tree of v.

Here’s an illustration of the BST property:

And here’s a sample BST:

Note that a BST is not necessarily a full binary tree; some internal nodes might have only one child. From the BST property, it is not hard to see that

Lemma: if we perform Inorder-Traverse on a BST, we get all keys printed in non-decreasing order.

We can prove this by induction on the height of the tree. If the height of the tree is zero then there’s only one node which gets printed by Inorder-Traverse. In the general case, Inorder-Traverse(v->left) prints all nodes on the left sub-tree of v in non-decreasing order, and then v->key is printed, and finally all the keys on the right are printed in non-decreasing order. By the BST property, all printed keys are in non-decreasing order.

Exercise: argue that an inorder sequence of keys does not uniquely determine a BST. How about a postorder sequence of keys? What is the number of BSTs with the same inorder sequence of keys?

The following are the main operations that a BST provides, hopefully in an efficient manner:

search(key): returns NULL if key is not found, and a pointer to a node with the key if it is found. If the tree stores duplicate keys (with different values), the simplest solution is to keep at each node a linked list of all those values. In that case, we can return the list of all nodes with the same key easily. For now, our search routine will simply return any node with the given key.
minimum/maximum: returns the pointer to the node with the minimum or maximum key.
successor/predecessor: the successor of a node is the node with the next largest key. If the current node has the largest key then successor returns NULL. The predecessor of a node v is the node whose succesor is v. Again, the issue of duplicate keys might complicate matters a little. We will discuss this issue later.
insert(key, value): insert a new node with the given (key, value) pair in to a BST, maintaining the BST property
delete(node): delete a given node from the tree.

2.1. Search

Searching is the easiest of all routines. The recursive version is slightly easier to write but it is less efficient than the iterative search (unless the compiler is very smart!). So we write an iterative version of the search routine:


/*
 * -----------------------------------------------------------------------------
 * iterative search in a BST
 * -----------------------------------------------------------------------------
 */
template <typename Key, typename Value>
BSTNode<Key, Value>* bst_search(BSTNode<Key, Value>* root, Key key) {
    while (root != NULL && root->key != key) {
        if (key < root->key) root = root->left;
        else root = root->right;
    }
    return root;
}

The total run time is O(h) where h is the height of the tree.

2.3. Maximum, Minimum, Successor, Predecessor

Finding the maximum and minimum nodes are very easy: the maximum node is the right-most node of a binary search tree, and the minimum node is the left-most node.


/*
 * -----------------------------------------------------------------------------
 * returns the maximum node of a BST with the given root node
 * -----------------------------------------------------------------------------
 */
template <typename Key, typename Value>
BSTNode<Key, Value>* maximum(BSTNode<Key, Value>* root) {
    if (root == NULL) return NULL;
    while (root->right != NULL) root = root->right;
    return root;
}

/*
 * -----------------------------------------------------------------------------
 * returns the minimum node of a BST with the given root node
 * -----------------------------------------------------------------------------
 */
template <typename Key, typename Value>
BSTNode<Key, Value>* minimum(BSTNode<Key, Value>* root) {
    if (root == NULL) return NULL;
    while (root->left != NULL) root = root->left;
    return root;
}

If a node v has a right branch, then the predecessor of v is the minimum node on the right branch. This is because, the minimum node on the right branch is printed right after v in the inorder traversal of the tree. If v does not have a right branch, then the predecessor of v is the first ancessor u along the path from v up to the root for which another ancessor (including v) is the right child of u. This is because v would be the minimum node on the right branch of u and thus v would be a successor of u. The reasoning for a successor is reversed. From the analysis, we get the following:


/*
 * -----------------------------------------------------------------------------
 * return the successor of a given node;
 * - if the node has a right branch, the successor is the left-most node on
 *   the right branch
 * - otherwise, it is the first ancestor whose left child is an ancestor
 * -----------------------------------------------------------------------------
 */
template <typename Key, typename Value>
BSTNode<Key, Value>* successor(BSTNode<Key, Value>* node) {
    if (node == NULL) return NULL;
    if (node->right != NULL) return minimum(node->right);
    BSTNode<Key, Value>* p = node->parent;
    while (p != NULL && p->right == node) {
        node = p;
        p = p->parent;
    }
    return p;
}

/*
 * -----------------------------------------------------------------------------
 * return the predecessor of a given node;
 * - if the node has a left branch, the predecessor is the right-most node on
 *   the left branch
 * - otherwise, it is the first ancestor whose right child is an ancestor
 * -----------------------------------------------------------------------------
 */
template <typename Key, typename Value>
BSTNode<Key, Value>* predecessor(BSTNode<Key, Value>* node) {
    if (node == NULL) return NULL;
    if (node->left != NULL) return maximum(node->left);
    BSTNode<Key, Value>* p = node->parent;
    while (p != NULL && p->left == node) {
        node = p;
        p = p->parent;
    }
    return p;
}

The run time of all four routines are again O(h) where h is the height of the tree.

2.2. Insert

To insert a new node v into a tree, compare v->key with the root’s key and move left if v->key is smaller, right otherwise. Eventually, we will reach a NULL node and that’s where v will reside. In the entire process, we always ensure that v ends up in the correct sub-tree maintaining the BST property. The only minor technical point we need to remember is that we have to keep a parent pointer to the current visited node so that when we get to NULL we can put v as a child of that parent.


/*
 * -----------------------------------------------------------------------------
 * insert a new node to a BST
 * -----------------------------------------------------------------------------
 */
template <typename Key, typename Value>
void bst_insert(BSTNode<Key, Value>*& root, BSTNode<Key, Value>* node) {
    node->left = node->right = NULL;
    BSTNode<Key, Value>* p   = NULL;
    BSTNode<Key, Value>* cur = root;
    while (cur != NULL) {
        p = cur;
        if (node->key < cur->key) cur = cur->left;
        else cur = cur->right;
    }

    node->parent = p;
    if (p == NULL) {
        root = node; // empty tree to start with
    } else if (node->key < p->key) {
        p->left = node;
    } else {
        p->right = node;
    }
}

Insertion also takes O(h)-time as we only move down the depth of the tree and does a constant amount of work after that.

2.3. Delete

Deleting a given node v from the tree is slightly more complicated. If v has at most one child, then we can simply “splice” v by connecting v‘s parent and v child (or NULL if v has no child). The more complex case is when v has both children. The main idea is to find a node u which is v‘s successor. Since v has a right child, node u is the left-most node on the right branch of v. Hence, we can easily splice u because u has no left child. Then, we put the content of u in v and we’d be done.


/*
 * -----------------------------------------------------------------------------
 * delete a node from the tree; we assume that v is a valid pointer to a
 * node in the tree
 * -----------------------------------------------------------------------------
 */
template <typename Key, typename Value>
void bst_delete(BSTNode<Key, Value>*& root, BSTNode<Key, Value>* v) {
    // determine the node to be spliced
    BSTNode<Key, Value>* u;
    if (v->left == NULL || v->right == NULL) u = v;
    else u = successor(v);

    // keep u's child c, the non-NULL child if any
    BSTNode<Key, Value>* c;
    if (u->left == NULL) c = u->right;
    else c = u-> left; // could still be NULL, but that's fine

    BSTNode<Key, Value>* p = u->parent;
    if (p == NULL) // this means u is the root
        root = NULL;
    else
        if (p->left == u) p->left = c;
        else p->right = c;

    // finally put u's content in v's if they're different, and delete u
    if (v != u) {
        v->key   = u->key;
        v->value = u->value;
    }
    delete u;
}

Again, the routine runs in time O(h).

3. Random binary search trees, tree text-printing exercises

All of the major operations: search, minimum, maximum, successor, and predecessor runs in time propotional to the height of the tree. It is easy to find a sequence of insertions into an empty tree which makes the tree height linear in the number of nodes. In such cases, searching takes linear time and the tree structure is no better than a simple array.

We will in later lectures discuss more elaborate (binary) search tree data structures which can guarantee a worst-case height of O(log n). Before then, it is natural to wonder about the height of a random binary search tree. Generating a random binary search tree is also potentially useful in testing the above operations and future operations on BSTs too.

3.1. Generating a random binary search tree

We will generate a BST with n nodes with distinct keys from the set {0, 1, ..., n-1}. The root node’s key is first chosen uniformly at random from the set. Suppose key i is chosen. Then, the left sub-tree is recursively generated randomly from the set of keys {0,1,...i-1}, and the right sub-tree is recursively generated randomly from the set {i+1,...,n-1} = {i+1+0, i+1+1, ..., i+1+(n-i-2)}. The strategy leads naturally to the following algorithm for random BST generation:


/*
 * -----------------------------------------------------------------------------
 * create a random binary search tree with n nodes and return a pointer to
 * the root; the strategy is recursive.
 * - The n keys are integers from [base, ..., base+n-1]
 * - we need the base for the recursive calls; initial base is 0
 * - we first pick randomly the key for the root
 *   the key = base + root_rank, where root_rank is between 0 and n-1
 * - then, recursively create the left sub-tree from base to root_rank-1
 * - and recursively create the right sub-tree from base+root_rank+1 to n-1
 * if n<=0 then return NULL
 * -----------------------------------------------------------------------------
 */
BSTNode<int, string>* random_bst(size_t base, size_t n,
                                 BSTNode<int, string>* p)
{
    if (n <= 0) return NULL;

    size_t root_rank = rand() % n;
    ostringstream oss;
    oss << "Node" << base + root_rank;
    BSTNode<int, string>* node =
        new BSTNode<int, string>(base+root_rank , oss.str(), p);
    node->left = random_bst(base, root_rank, node);
    node->right = random_bst(base+root_rank+1, n-root_rank-1, node);
    return node;
}

3.2. Printing a binary tree

Generating a BST at random was nice, and it is easy to show that a preorder (for example) sequence of keys uniquely determines a BST. However, that requires a bit of eye inspection, especially for larger trees. It would be nice to be able to visually print out a BST. This operation is important not only for debugging various stages of a tree construction, but also in real-world softwares; for example, Matlab and R often have to print out decision trees representing classification rules.

Exercise (vertical tree): printing a binary tree vertically is easier, and this is the first printing exercise. Write a function which takes (the pointer to the root of a) binary tree and prints out the binary tree in the following format:
Node7
|__Node9
|  |__o
|  |
|  |__Node8
|
|__Node6
   |__o
   |
   |__Node3
      |__Node4
      |  |__Node5
      |  |
      |  |__o
      |
      |__Node1
         |__Node2
         |
         |__Node0

where 'o' represents NULL, and the right branch is printed before the left branch.

The vertical tree is slightly easier to program, but it is not visually satisfying.

Exercise: write a function whic prints out a given tree horizontally in the following form:
            ____Node3_____
           /              \
      ___Node1___          o
     /           \
  Node0         Node2
For simplicity, you can start with the assumption that that all node values are strings with length at most 3. Then, extend the solution to arbitrary string lengths.

3.3. Expected height of a random BST (you can skip this section)

In this section, we derive an upperbound for the expected height of a random BST generated using the above algorithm. CSE 694 covers the type of analysis presented here. The take home lesson is that a random BST has O(log n) expected height. This result has a practical implication. Suppose we want to build a search structure to store n words from a dictionary, one candidate method would be to randomly permute all the words and then insert them into a BST using the above bst_insert() algorithm. The BST will likely have logarithmic height and thus future searches into this tree takes logarithmic time.

It is actually not exactly clear if our random BST is the right model for a random permutation insertion into a BST, but that is not hard to see. Consider a random permutation of numbers 0, 1, ..., n-1, where i is the start of the permutation. Then, i will be the root’s key, and numbers from 0 to i-1 goes to the left of i and the rest goes to the right sub-tree. This structure is exactly the recursive structure random_bst generates.

The expected height of a random BST is a very well-studied problem. Let H_n denote the height of a random BST, then it is known that

$\text{E}[H_n] = \alpha \ln n - \beta \ln \ln n + O(1)$

where $\alpha = 4.311\dots$ and $\beta = 1.95\dots$ . Also the variance of H_n is a constant too! So that distribution is strongly concentrated around the mean. In this section, we will prove a weaker result, that $\text{E}[H_n] = O(\log n)$ .

For technical reasons, define $X_n = 2^{H_n}$ . (This is similar to Bernstein’s trick for proving Chernoff-style bounds.) We will bound $\text{E}[X_n]$ first, then use the convexity of the exponential function to bound $\text{E}[H_n]$ . For $i \in \{0,1,\dots,n-1\}$ , let Z_i be the indicator variable for the event that was chosen as the root key. Then, $H_n = 1 + \max\{H_i, H_{n-1-i}\}$ . Or, equivalently,

$X_n = \sum_{i=0}^{n-1} Z_i \cdot 2 \max\{X_i, X_{n-1-i}\}.$

Now, since Z_i and $X_i, X_{n-1-i}$ are independent, by linearity of expectation we have

$\begin{array}{rcl} \text{E}[X_n] &=& 2 \sum_{i=0}^{n-1} \text{E}[Z_i] \cdot \text{E}[\max\{X_i, X_{n-1-i}\}]\\ &=& \frac 2 n \sum_{i=0}^{n-1} \text{E}[\max\{X_i, X_{n-1-i}\}] \\ &\leq& \frac 2 n \sum_{i=0}^{n-1} (\text{E}[X_i] + \text{E}[X_{n-1-i}\}])\\ &=& \frac 4 n \sum_{i=0}^{n-1} \text{E}[X_i] \end{array}$

Let $a_n = \sum_{i=0}^n \text{E}[X_i]$ ; then, from the above inequality we have $a_n \leq \frac{n+4}{n} a_{n-1}$ . For analytical convenience, we set H_0 = 0 and H_1=1 (counting NULL as a node); from that X_0=1 , and thus a_0 = 1 . We thus have

$a_n \leq \frac{n+4}{n} a_{n-1} \leq \cdots \leq \frac{n+4}{n}\frac{n+3}{n-1}\cdots\frac{6}{2} = \frac{(n+4)!}{n!5!} = \frac{1}{5}\binom{n+4}{4}$

Consequently,

$\text{E}[X_n] \leq \frac 4 n a_{n-1} \leq \frac{4}{5n} \binom{n+3}{4} = O(n^3)$

Since the exponential function is convex, by Jensen inequality we get

$2^{\text{E}[H_n]} \leq \text{E}[2^{H_n}] = \text{E}[X_n] = O(n^3)$

and thus $\text{E}[H_n] = O(\log n)$ .

4. Optimal binary search trees (you should skim through the section)

As stated earlier, we will discuss binary search trees which have $O(\log n)$ -height in the worst-case. In such cases, searches into the tree will cost $O(m\log n)$ -time. However, there are problems in which a $O(\log n)$ -cost per search might be larger than necessary. For example, suppose we have a good estimate for each key k_i , $i \in [n] = \{1,\dots,n\}$ the frequency at which the key is going to be searched, then it would make sense for the more frequently searched key to have lower depth than the less frequently searched key. This idea is simlar to Huffman coding described above. However, the key difference here is that a BST key can be stored in an internal node of the tree too.

To formalized this problem, let’s say we have a set $k_1 < k_2 < \cdots < k_n$ of keys, and the corresponding search probabilities $p_1, p_2, \dots, p_n$ . Now, some times we will also search for keys which are not in this set. A key with value less than k_1 is searched with probability q_0 ; a key with value between k_i and $k_{i+1}$ is searched with probability q_i ; and, a key with value greater than k_n is search with probability q_n . Thus, overall, we have

$\sum_{i=1}^n p_i + \sum_{j=0}^n q_j = 1$

but this fact is not important for us. We will have n+1 nodes in the tree that represent "dummy keys", which are keys "in between" the k_i . Let’s call them the keys d_j , $0 \leq j \leq n$ . They will be the leaves of our BST. Let D_T(v) denote the depth of node in tree . Then, we want to construct a BST T to minimize the total cost:

$\sum_{i=1}^n D_T(k_i)p_i + \sum_{j=0}^n D_T(d_j)q_j$

This roughly corresponds to the expected search cost of a random query into the tree, given that the query’s key follows the distribution defined by the p_i and the q_j .

Here’s the key observation: let T be an optimal BST for this problem, then any sub-tree T’ of T must be optimal for its sub-problem. What we mean by this is the following: if T’ contains keys $k_i, \dots, k_j$ and dummy keys $d_{i-1}, \dots, d_j$ , then the sum

$\sum_{\ell = i}^j D_{T'}(k_\ell)p_\ell + \sum_{\ell = i-1}^{j} D_{T'}(d_\ell)q_\ell$

has to be minimum among all trees too. Hence, if we let e[i,j] denote the optimal cost of a tree which contains keys $k_i,\dots,k_j$ , then we can write down a recurrence for e[i,j] as follows.

$e[i,i-1] = q_{i-1}$ (this is for the tree that only contains $d_{i-1}$ )
$e[i,j] = \min_{i\leq r \leq j} \left\{ e[i,r-1] + e[r+1, j] + \sum_{\ell=i}^j p_\ell + \sum_{\ell=i-1}^j q_\ell \right\}$ if $i \leq j$

Finally, dynamic programming completes the job. You’ll learn much more about dynamic programming in CSE 331, CSE 431/531.

Link to full article

ThaicomGroup.com