March 7, 2011

## Way Up High A Binary Tree

Way up high in the binary tree
Two binary bits smiled at me
I shook that tree as hard as I could
Down came the binary bits,
Mmmm–were they good! -based on a children’s song

Now it turns out that this binary tree is an AVL tree. I wonder how high this tree is! Why does it concern us ? The reason is that the height will give us the worst case running time of the binary search tree algorithm. An AVL tree maintains its height by making sure the difference in height between the left and right subtree is at most 1. If $N_h$ is the number of nodes contained in the tree of height h, then $N_h$ is equal to the number of nodes in the left subtree plus the number of nodes in the right subtree PLUS 1 (the root node). In the worst case, the two subtrees differ in height by at most 1, therefore

$\displaystyle N_h = N_{h-1} + N_{h-2} + 1$

with initial conditions $N_0 = 0$ and $N_1 = 1$.

By adding 1 to both sides of this recurrence relation we get

$\displaystyle N_h + 1 = N_{h-1} + 1 + N_{h-2} + 1$

If we let $b_h = N_h + 1$, we can rewrite the above recurrence relation as

$\displaystyle b_h = b_{h-1} + b_{h-2}$

The initial conditions are now

$b_0 = N_0 + 1 = 0 + 1 = 1$ and $b_1 = N_1 + 1 = 1 + 1 = 2$

Now this is interesting! This recurrence relation is very similar to the Fibonacci recurrence relation except that the initial conditions are different. From the previous post, we found that the general solution of this recurrence relation is

$\displaystyle b_n = \alpha_1 \phi^n + \alpha_2 (1-\phi)^n$

where $\alpha_1$ and $\alpha_2$ are constants yet to be determined. Let us solve this recurrence relation in the general case where $b_0 = s$ and $b_1=t$. Computing the first 2 terms of the recurrence relation, we get two simultaneous equations in $\alpha_1$ and $\alpha_2$.

$\displaystyle b_0 = s = \alpha_1 + \alpha_2$
$\displaystyle b_1 = t = \alpha_1 \phi + \alpha_2 (1-\phi)$

Solving for $\alpha_1$ from the first equation, we get

$\displaystyle \alpha_1 = s- \alpha_2$

Plugging this into the second equation and solving for $\alpha_2$

$\displaystyle t = (s-\alpha_2) \phi + \alpha_2 - \alpha_2\phi$
$\displaystyle t = s\phi + \alpha_2(1- 2\phi)$
$\displaystyle \alpha_2 = \frac{t-s\phi}{1-2\phi}$

From which we solve for $\alpha_1$

$\displaystyle \alpha_1 = s - \frac{t-s\phi}{1-2\phi}$
$\displaystyle = \frac{s(1-\phi) - t }{ 1-2\phi}$

Substituting this to the recurrence relation of $b_n$, we have

$\displaystyle b_n = \frac{s(1-\phi) -t }{1-2\phi} \phi^n + \frac{t - s\phi}{1-2\phi} (1-\phi)^n$

Expanding this expression you’ll end up with a rather complex expression involving cross terms of $\phi^n$ and $(1-\phi)^n$

$\displaystyle b_n = \frac{s\phi^n ( 1- \phi) - t\phi^n + t(1-\phi)^n - s\phi(1-\phi)^n}{1-2\phi}$

But that is where the complexity ends. Observe that

$\displaystyle \phi(1-\phi) = \frac{1+\sqrt{5}}{2}\cdot \frac{1-\sqrt{5}}{2}$
$\displaystyle = \frac{1-5}{4} = -1$

Using this result, we can write

$\phi^{n}(1-\phi) = \phi^{n-1}\phi (1-\phi) = -\phi^{n-1}$
$\phi (1-\phi)^n = \phi (1-\phi) (1-\phi)^{n-1} = - (1-\phi)^{n-1}$

Furthermore, we also observe that

$\displaystyle 1-2\phi = 1 - \frac{1+ \sqrt{5}}{2} = -\sqrt{5}$

Using all these results we have

$\displaystyle b_n = - s\phi^{n-1} - t\phi^n + t(1-\phi)^n + s(1-\phi)^{n-1}$
$\displaystyle = s \Big[ \frac{(\phi^{n-1} - (1-\phi)^{n-1}}{\sqrt{5}} +\Big] t \Big[ \frac{(\phi^n - (1-\phi)^n}{\sqrt{5}}\Big]$

Notice that the expressions in square brackets are just fibonacci expressions for n-1 and n, respectively. If we use them in the expression for $b_n$, we can express it in terms of the fibonacci numbers

$\displaystyle b_n = sf_{n-1} + tf_n$

Returning to our computation for the height of the AVL tree, since $b_0 = s = 1$ and $b_1 = t = 2$, we finally have

$\displaystyle b_n = f_{n-1} + 2f_n$

We can further simplify this by

$\displaystyle b_n = f_{n-1} + f_n + f_n$
$\displaystyle = \Big(f_{n-1} + f_n\Big) + f_n$
$\displaystyle = f_{n+1} + f_n$
$\displaystyle = f_{n+2}$

Since $b_n = N_h + 1$,

$\displaystyle N_h + 1 = f_{n+2}$
$\displaystyle N_h = f_{n+2} -1$

The formula above expresses the number of nodes of the AVL tree in terms of the height h. In terms of the golden ratio, we can write the last expression as

$\displaystyle N_h = \frac{\phi^{n+2} - (1-\phi)^{n+2}}{\sqrt{5}} - 1$

Approximating the nth Fibonacci Number

We can further simplify the above expression by observing that the nth fibonacci number is approximately equal to $\phi^n/\sqrt{5}$. To see this, we compute the distance of $f_n$ to $\phi^n/\sqrt{5}$:

$\displaystyle |f_n - \frac{\phi^n}{\sqrt{5}}| = |\frac{\phi^n - (1-\phi)^n}{\sqrt{5}} - \frac{\phi^n}{\sqrt{5}} | = |\frac{(1-\phi)^n}{\sqrt{5}}| < \frac{1}{\sqrt{5}} < \frac{1}{2}$

Using this approximation, we can write

$\displaystyle N_h = \frac{\phi^{h+2}}{\sqrt{5}} -1$

So if n is the number of nodes in an AVL tree, we can estimate the height using the above formula, that is

$\displaystyle n \approx \frac{\phi^{h+2}}{\sqrt{5}} -1$
$\displaystyle n + 1 \approx \frac{\phi^{h+2}}{\sqrt{5}}$
$\displaystyle \sqrt{5}(n+1) \approx \phi^{h+2}$
$\displaystyle h \approx \log_{\phi} \Big( \sqrt{5} (n+1)\Big) -2$

Using a change of base

$\displaystyle n \approx \frac{\log_2 \Big( \sqrt{5}(n+1)\Big)}{\log_2 \phi} -2$
$\displaystyle \approx \frac{\log_2\Big( \sqrt{5}(n+1)\Big) - 2\log_2\phi}{\log_2 \phi}$
$\displaystyle \approx \frac{1}{\log_2 \phi} \cdot \log_2\Big(\frac{\sqrt{5}(n+1)}{\phi^2}\Big)$

Again using the fibonacci number approximation, we have $\sqrt{5}/\phi^2 \approx f_2 = 1$. This simplifies our estimate to

$\displaystyle h \approx 1.44\log_2(n+1)$

where $1/\log_2 \phi = 1.44$.

If we have an array of a million names, then the worst case running time of the binary search is

$\displaystyle 1.44\log_2(1000000 + 1) = 28.70$.

That's incredibly efficient!

January 29, 2011

## A Recurring Theme

One of the goals of these series of articles is to give a flavor of how we programmers get a feel of the running time of our algorithms. In the last article, we showed how fast a binary search is compared to a linear search using simple arguments that anyone can do. We do this routinely in elementary algebra where we get the solution of a quadratic equation and we try to see if we can reduce it to its common factors. However for general quadratic equations, we can always apply the quadratic formula to solve for the unknown. In this article, we are going to investigate solving for the running time of a binary search using a technique called recurrence relations.

We shall repeat below the listing of the binary search algorithm:

algorithm binarySearch( Array A):
1. get the middle element A[middle]
2. compare x with A[middle]. If x equals A[middle] return middle.
3. if x less than A[middle], return binary search(A[1:middle-1]
4. else return binarySearch(A[middle + 1, n])


Floor and Ceiling functions

In reality, we don’t always have arrays that come in lengths of powers of 2. So when we halve an array, it does not necessarily divide evenly. This means that we can get a number that is not an integer. To deal with situations like this, we define the floor and ceiling functions.

The floor function, when it operates on a real number $x$, returns the largest whole number that is less than $x$. On the other hand, the ceiling function, when it operates on an integer $x$ returns the smallest whole number that is greater than $x$. We symbolize the floor function acting on a real number $x$ as $\lfloor x \rfloor$ and the ceiling function is denoted by $\lceil x\rceil$. For example, if $x = 3.14$, then $\lfloor x \rfloor = 3$ and $\lceil x \rceil = 4$. Now suppose, we have an array of length 10. If we halve this, we have 5. Halving again, we get 2.5, which is not an integer. So we either get the floor to get 2 or the ceiling to get 3.

Continuing with the Analysis

In the binary search algorithm defined above, we did not specify how to compute “mid”. For the sake of our discussion, let us take $mid = \lfloor x/2 \rfloor$, where $x$ is the length of the input array of the algorithm.

We observe from the binary search algorithm above the following:

– if the length $x$ is an odd number, the array can be split into two equal parts which is equal to $\lfloor x \rfloor$. If the length is even, then the length of the right subarray is $\lfloor x/2 \rfloor -1$ and the right-hand side has length $\lfloor x/2 \rfloor$. Since we are interested in the worst-case complexity of this algorithm, we take $\lfloor x/2 \rfloor$ as the input to the next iteration.
– Every iteration has one comparison and it is comparing the search number to the middle element.
– The total number of comparisons is therefore equal to 1 plus the number of comparisons for the subarray of size $\lfloor x/2 \rfloor$.

Mathematically, this means that if $n$ is the size of the array and $a_n$ is the number of comparisons executed by the binary search for an array of size $n$, then

$\displaystyle a_n = a_{\lfloor n/2 \rfloor} + 1$

where $a_{\lfloor n/2 \rfloor}$ is the number of comparisons needed to run the binary search for an array of $\lfloor n/2 \rfloor$ elements. The equation above is what is called a recurrence relation for the algorithm. In this case, it is the recurrence relation for the binary search algorithm. In the special case that the array has only one element, we impose the condition that

$\displaystyle a_1 = 1$

because there is only one comparison involved.

A Brute Force Solution

One way to solve this recurrence relation is by listing the first few terms of this recurrence relation.
$\displaystyle a_1 = 1*$
$\displaystyle a_2 = a_{\lfloor 2/2 \rfloor} + 1 = a_{1} + 1 = 1 + 1 = 2*$
$\displaystyle a_3 = a_{\lfloor 3/2 \rfloor} + 1 = a_{1} + 1 = 1 + 1 = 2$
$\displaystyle a_4 = a_{\lfloor 4/2 \rfloor} + 1 = a_{2} + 1 = 2 + 1 = 3*$
$\displaystyle a_5 = a_{\lfloor 5/2 \rfloor} + 1 = a_{2} + 1 = 2 + 1 = 3$
$\displaystyle a_6 = a_{\lfloor 6/2 \rfloor} + 1 = a_{3} + 1 = 2 + 1 = 3$
$\displaystyle a_7 = a_{\lfloor 7/2 \rfloor} + 1 = a_{3} + 1 = 2 + 1 = 3$
$\displaystyle a_8 = a_{\lfloor 8/2 \rfloor} + 1 = a_{4} + 1 = 3 + 1 = 4*$
$\displaystyle a_9 = a_{\lfloor 9/2 \rfloor} + 1 = a_{4} + 1 = 3 + 1 = 4$
$\displaystyle a_{10} = a_{\lfloor 10/2 \rfloor} + 1 = a_{5} + 1 = 3 + 1 = 4$
$\displaystyle a_{11} = a_{\lfloor 11/2 \rfloor} + 1 = a_{5} + 1 = 3 + 1 = 4$
$\displaystyle a_{12} = a_{\lfloor 12/2 \rfloor} + 1 = a_{6} + 1 = 3 + 1 = 4$
$\displaystyle a_{13} = a_{\lfloor 13/2 \rfloor} + 1 = a_{6} + 1 = 3 + 1 = 4$
$\displaystyle a_{14} = a_{\lfloor 14/2 \rfloor} + 1 = a_{7} + 1 = 3 + 1 = 4$
$\displaystyle a_{15} = a_{\lfloor 15/2 \rfloor} + 1 = a_{7} + 1 = 3 + 1 = 4$
$\displaystyle a_{16} = a_{\lfloor 16/2 \rfloor} + 1 = a_{8} + 1 = 3 + 1 = 5*$

and so on…

We now start to see a pattern! Look at the lines with * in them. We have a * when

$\displaystyle n = 1 = 2^0, \text{ and } a_{n} = 1$
$\displaystyle n = 2 = 2^1, \text{ and } a_{2} = 2 = 1 + 1$
$\displaystyle n = 4 = 2^2, \text{ and } a_{n} = 3 = 2 + 1$
$\displaystyle n = 8 = 2^3, \text{ and } a_{n} = 4 = 3 + 1$
$\displaystyle n = 16 = 2^4, \text{ and } a_{n} = 5= 4 + 1$

So for powers of 2, $a_n = \log_2 n + 1$. Now what happens between powers of 2?

$\displaystyle n = 3, a_{3} = 1 + 1$
$\displaystyle n = 5, a_{5} = 2 + 1$
$\displaystyle n = 6, a_{5} = 2 + 1$
$\displaystyle n = 7, a_{5} = 2 + 1$
$\displaystyle n = 9, a_{5} = 5 + 1$
$\displaystyle n = 10, a_{5} = 5 + 1$
$\displaystyle n = 11, a_{5} = 5 + 1$
$\displaystyle n = 12, a_{5} = 5 + 1$
$\displaystyle n = 13, a_{5} = 5 + 1$
$\displaystyle n = 14, a_{5} = 5 + 1$
$\displaystyle n = 15, a_{5} = 5 + 1$

So for $n$ between powers of 2, that is, $2^n \le n < 2^{n+1}$, we see that $a_{n} = \lfloor \log_2 n \rfloor + 1$. Since this expression is also satisfied for $n$ equal to powers of 2, the solution of the recurrence relation is therefore

$\displaystyle a_n = \lfloor \log_2 n \rfloor + 1$

Now, are we sure that this is true for all $n$? That's another topic we will discuss in a future post.