Archive for ‘Algorithms and Complexity’

January 29, 2011

The Big Oh!

by Bobby Corpus

We have demonstrated how basic counting is used to determine the running time of our algorithms. We have seen that the running time of traversing an array element by element is proportional to n, where n is the size of the array. We have seen that a code snippet that contains 2 loops is proportional to n^2. And we have seen that halving the input in every iteration is proportional to \log_2 n. Do we really need to know the exact number of comparisons in order to determine the running time of an algorithm? Or is an estimate sufficient? For a sufficiently large n, the answer is YES an estimate is sufficient!

In one of the articles, we found an algorithm that has a running time of \frac{n(n+1)}{2}. If we take a look at the graph of \frac{n(n+1)}{2} versus n^2, we will see that n^2 is greater than \frac{n(n+1)}{2}.

This means that \frac{n(n+1)}{2} is bounded above by n^2. To generalize this concept, we define what is known as the “Big Oh” approximation. Given two functions f(x) and g(x), we say that f is of order g, written f(x) is O(g(x)), if and only if, there exist a positive real number M and a real number x_0 such that for all x > x_0,

\displaystyle |f(x)| \le M\cdot |g(x)|, whenever x > x_0.

Using this definition, we can show that \frac{n(n+1)}{2} is O(n^2). Expanding

\displaystyle \frac{n(n+1)}{2} = \frac{n^2 + n }{2}
\displaystyle \frac{n^2 + n }{2}  \le \frac{n^2 + n^2}{2} = \frac{2n^2}{2} = n^2

By taking M = 1, we have shown that \frac{n(n+1)}{2} = O(n^2).

Another example would be computing the Big Oh of the binary search. We have seen in this post that the running time of the binary search is

\displaystyle a_n = \lfloor \log_2 n \rfloor + 1

Since \lfloor \log_2 n \rfloor \le \log_2 n and \log_2 n > 1 for n > 2, we have

\displaystyle a_n = \lfloor \log_2 n \rfloor + 1 \le \log_2 n + \log_2 n = 2\log_2 n. By taking M=2 and x_0 = 2, we have

\displaystyle |a_n| \le 2\cdot \log_2 n. Therefore, the binary search algorithm is O(\log_2 n).

January 29, 2011

A Recurring Theme

by Bobby Corpus

One of the goals of these series of articles is to give a flavor of how we programmers get a feel of the running time of our algorithms. In the last article, we showed how fast a binary search is compared to a linear search using simple arguments that anyone can do. We do this routinely in elementary algebra where we get the solution of a quadratic equation and we try to see if we can reduce it to its common factors. However for general quadratic equations, we can always apply the quadratic formula to solve for the unknown. In this article, we are going to investigate solving for the running time of a binary search using a technique called recurrence relations.

We shall repeat below the listing of the binary search algorithm:

algorithm binarySearch( Array A): 
1. get the middle element A[middle]
2. compare x with A[middle]. If x equals A[middle] return middle.
3. if x less than A[middle], return binary search(A[1:middle-1]
4. else return binarySearch(A[middle + 1, n])

Floor and Ceiling functions

In reality, we don’t always have arrays that come in lengths of powers of 2. So when we halve an array, it does not necessarily divide evenly. This means that we can get a number that is not an integer. To deal with situations like this, we define the floor and ceiling functions.

The floor function, when it operates on a real number x, returns the largest whole number that is less than x. On the other hand, the ceiling function, when it operates on an integer x returns the smallest whole number that is greater than x. We symbolize the floor function acting on a real number x as \lfloor x \rfloor and the ceiling function is denoted by \lceil x\rceil. For example, if x = 3.14, then \lfloor x \rfloor = 3 and \lceil x \rceil = 4. Now suppose, we have an array of length 10. If we halve this, we have 5. Halving again, we get 2.5, which is not an integer. So we either get the floor to get 2 or the ceiling to get 3.

Continuing with the Analysis

In the binary search algorithm defined above, we did not specify how to compute “mid”. For the sake of our discussion, let us take mid = \lfloor x/2 \rfloor, where x is the length of the input array of the algorithm.

We observe from the binary search algorithm above the following:

- if the length x is an odd number, the array can be split into two equal parts which is equal to \lfloor x \rfloor. If the length is even, then the length of the right subarray is \lfloor x/2 \rfloor -1 and the right-hand side has length \lfloor x/2 \rfloor. Since we are interested in the worst-case complexity of this algorithm, we take \lfloor x/2 \rfloor as the input to the next iteration.
- Every iteration has one comparison and it is comparing the search number to the middle element.
- The total number of comparisons is therefore equal to 1 plus the number of comparisons for the subarray of size \lfloor x/2 \rfloor.

Mathematically, this means that if n is the size of the array and a_n is the number of comparisons executed by the binary search for an array of size n, then

\displaystyle a_n = a_{\lfloor n/2 \rfloor} + 1

where a_{\lfloor n/2 \rfloor} is the number of comparisons needed to run the binary search for an array of \lfloor n/2 \rfloor elements. The equation above is what is called a recurrence relation for the algorithm. In this case, it is the recurrence relation for the binary search algorithm. In the special case that the array has only one element, we impose the condition that

\displaystyle a_1 = 1

because there is only one comparison involved.

A Brute Force Solution

One way to solve this recurrence relation is by listing the first few terms of this recurrence relation.
\displaystyle a_1 =  1*
\displaystyle a_2 = a_{\lfloor 2/2 \rfloor} + 1 = a_{1} + 1 = 1 + 1 = 2*
\displaystyle a_3 = a_{\lfloor 3/2 \rfloor} + 1 = a_{1} + 1 = 1 + 1 = 2
\displaystyle a_4 = a_{\lfloor 4/2 \rfloor} + 1 = a_{2} + 1 = 2 + 1 = 3*
\displaystyle a_5 = a_{\lfloor 5/2 \rfloor} + 1 = a_{2} + 1 = 2 + 1 = 3
\displaystyle a_6 = a_{\lfloor 6/2 \rfloor} + 1 = a_{3} + 1 = 2 + 1 = 3
\displaystyle a_7 = a_{\lfloor 7/2 \rfloor} + 1 = a_{3} + 1 = 2 + 1 = 3
\displaystyle a_8 = a_{\lfloor 8/2 \rfloor} + 1 = a_{4} + 1 = 3 + 1 = 4*
\displaystyle a_9 = a_{\lfloor 9/2 \rfloor} + 1 = a_{4} + 1 = 3 + 1 = 4
\displaystyle a_{10} = a_{\lfloor 10/2 \rfloor} + 1 = a_{5} + 1 = 3 + 1 = 4
\displaystyle a_{11} = a_{\lfloor 11/2 \rfloor} + 1 = a_{5} + 1 = 3 + 1 = 4
\displaystyle a_{12} = a_{\lfloor 12/2 \rfloor} + 1 = a_{6} + 1 = 3 + 1 = 4
\displaystyle a_{13} = a_{\lfloor 13/2 \rfloor} + 1 = a_{6} + 1 = 3 + 1 = 4
\displaystyle a_{14} = a_{\lfloor 14/2 \rfloor} + 1 = a_{7} + 1 = 3 + 1 = 4
\displaystyle a_{15} = a_{\lfloor 15/2 \rfloor} + 1 = a_{7} + 1 = 3 + 1 = 4
\displaystyle a_{16} = a_{\lfloor 16/2 \rfloor} + 1 = a_{8} + 1 = 3 + 1 = 5*

and so on…

We now start to see a pattern! Look at the lines with * in them. We have a * when

\displaystyle n = 1 = 2^0, \text{ and } a_{n} = 1
\displaystyle n = 2 = 2^1, \text{ and } a_{2} = 2 = 1 + 1
\displaystyle n = 4 = 2^2, \text{ and } a_{n} = 3 = 2 + 1
\displaystyle n = 8 = 2^3, \text{ and } a_{n} = 4 = 3 + 1
\displaystyle n = 16 = 2^4, \text{ and } a_{n} = 5= 4 + 1

So for powers of 2, a_n = \log_2 n + 1. Now what happens between powers of 2?

\displaystyle n = 3, a_{3} = 1 + 1
\displaystyle n = 5, a_{5} = 2 + 1
\displaystyle n = 6, a_{5} = 2 + 1
\displaystyle n = 7, a_{5} = 2 + 1
\displaystyle n = 9, a_{5} = 5 + 1
\displaystyle n = 10, a_{5} = 5 + 1
\displaystyle n = 11, a_{5} = 5 + 1
\displaystyle n = 12, a_{5} = 5 + 1
\displaystyle n = 13, a_{5} = 5 + 1
\displaystyle n = 14, a_{5} = 5 + 1
\displaystyle n = 15, a_{5} = 5 + 1

So for n between powers of 2, that is, 2^n \le n < 2^{n+1}, we see that a_{n} = \lfloor \log_2 n \rfloor + 1. Since this expression is also satisfied for n equal to powers of 2, the solution of the recurrence relation is therefore

\displaystyle a_n = \lfloor \log_2 n \rfloor + 1

Now, are we sure that this is true for all n? That's another topic we will discuss in a future post.

January 22, 2011

Divide et Impera

by Bobby Corpus

What’s the difference between managing 5 people and managing 25 people? None (Surprise!). Group the people into 5 and pick a leader in each group. Each leader manages his/her group and you manage the 5 leaders.

This strategy is known as divide and conquer and has been used by many successful military leaders like Julius Ceasar who has been attributed the phrase “divide et impera” or Divide and Conquer. Not surprisingly, this is also used in programming.

Imagine you are asked to search a number in a sorted array of 1 million numbers. How will you search for it? A novice programmer will probably search for it one element at a time. The code would end up like this:


public int findNumber(int numToFind, int[] array){

  for( int i=0;i<1000000; i++){
    if(a[i] == numToFind)
       return i;
  }

  return -1;
}

What’s the worst case that can happen when searching for a number in a sorted array? When the number is the last number in the array or is not even in the array! When that happens, the number of times you traverse the “for loop” is 1 million! You can just imagine how long it takes for this algorithm to execute if you are now searching on an array of 1 trillion numbers.

However, if you make use of the fact that the array is sorted, something beautiful happens. You can search the array of 1 million numbers in less than 20 comparisons! The way to do it is to start searching at the middle. If x is the number you are searching, compare x with the middle element. If x is equal to it, then you’ve found your number. If not, then either x is greater than the middle element or less than the middle element. If you’ve found that x is less than the middle element, then obviously x is not in the right side of the array but can be on the left side. I say “can be” because x might not even be in the array. We can write this in a more structured manner

algorithm binarySearch( Array A): 
1. get the middle element A[middle]
2. compare x with A[middle]. If x equals A[middle] return middle.
3. if x less than A[middle], return binary search(A[1:middle-1]
4. else return binarySearch(A[middle + 1, n])

So how many times do we compare x to the middle element? How many comparisons are there in each iteration? Only one. Now it only remains to count the number of iterations for any sorted integer array of size n. To make things simpler, let us assume that n = 2^m, for some m (that is, n is a power of 2). At every iteration, we split the array size into two equal parts. The final iteration occurs when you have sufficiently divided the array so that there is only one element. So when does this happen? This happens when you have halved the array m number of times, or mathematically:

\displaystyle m = \log_2 n

Therefore, if you are given a sorted array of 1 million integers, the number of comparisons you need is only

\displaystyle \log_2 1000000 = 19.9

or about 20 comparisons. That’s amazingly fast!

January 17, 2011

A Combinatorial Solution

by Bobby Corpus

In the last post, we computed the number of times the “Hello World” was printed when executing the code snippet below:


for(int i=0;i<10;i++){
   for(int j=0;j<i;j++){
     for(int k=0;k<j;k++){
         System.out.println("Hello World");
     }
   }
}

The computation that we did last time was too cumbersome. It turns out that there is a much better way to do it.

Imagine dividing a sheet of paper into ten regions by match sticks as shown below.


How many match sticks do you need to have ten regions? Nine. Suppose you have 3 blue chips. Distribute these chips into any region. They can be lumped together as shown in the figure above or you can put one chip in region 1 and two chips in region 2.


Or you can put a chip in each region.

Now, how does this relate to our original problem? By interpreting the regions as corresponding to the values 1 to 10 and the chips as the variables, whenever a chips is in a region, it assumes that value. So for example, in the first figure, all the chips are in region 1, so we say that the variables i,j,k all have values equal to 1. The second figure is more tricky, we have one chip in the first region, and 2 chips in the second region. So what are the values of i,j,k ? We get a hint from the code snippet itself. Since k \le j \le i, we interpret the leftmost chip as the k variable, the middle chip as the j variable and the rightmost chip as the i variable. So, in the second figure, the values of i,j,k are i=2, j=2, and k=1. In the third figure, the values of i,j,k are i=3,j=2, k=1.

The problem is now reduced to counting the number of ways of finding the positions of 3 chips out of 12 positions ( 9 matchings and 3 chips):

\displaystyle {{9+3}\choose{3}}

which is equal to 220, the number of times the “Hello World” is printed as we have seen in the previous post.

In general, if N is the range of values of i,j,k,

\displaystyle C = {{N+r -1}\choose{r}}

is the number of times the “Hello World” is printed.

From the previous post, we know that

\displaystyle C = \sum_{i=1}^N\sum_{j=1}^i j

Therefore,

\displaystyle C = \sum_{i=1}^N\sum_{j=1}^i j = \frac{1}{2}\Big(\sum_{i=1}^N i^2 + \sum_{i=1}^N i \Big)= {{N+r -1}\choose{r}}

As an aside, we can compute for the sum of squares of the first n numbers from the expression above:

\displaystyle \frac{1}{2}\Big(\sum_{i=1}^N i^2 + \sum_{i=1}^N i \Big)= {{N+r -1}\choose{r}}
\displaystyle \frac{1}{2}\Big(\sum_{i=1}^N i^2 + \sum_{i=1}^N i \Big)= \frac{(N+2)(N+1)(N)(N-1)!}{3! (N-1)!}
\displaystyle \sum_{i=1}^N i^2 = \frac{(N+2)(N+1)(N)}{3} - \frac{N(N+1)}{2}
\displaystyle = \frac{n(2n+1)(n+1)}{6}

In summary, the combinatorial solution given above is much more elegant as it gives us the answer without too much computation.

January 11, 2011

Some More Counting Techniques

by Bobby Corpus

Some people were not amused by the title of the previous post. It was in a way misleading. So I’m now going to give an appropriate title to this post, albeit a rather boring title.

In this article, let’s try to count the number of times the “Hello World” is printed by the code snippet below:


for(int i=0;i<10;i++){
   for(int j=0;j<i;j++){
     for(int k=0;k<j;k++){
         System.out.println("Hello World");
     }
   }
}

Let’s try to count for the first few values of i,j and k.

1,1,1 ----> 1

2,1,1 ----> 1 + 2
2,2,1
2,2,2

3,1,1 ----> 1 + 2 + 3
3,2,1
3,2,2
3,3,1
3,3,2
3,3,3

4,1,1 ----> 1 + 2 + 3 + 4
4,2,1
4,2,2
4,3,1
4,3,2
4,3,3
4,4,1
4,4,2
4,4,3
4,4,4

Looking at the pattern above, you can see that when i=1, the “Hello World” is printed only once. When i=2, it is executed 1 + 2 = 3. When i=3, it is executed 1 + 2 + 3 = 6 times, and when i=4, it is executed 1 + 2 + 3 + 4 = 10 times. So how many times is the statement executed when i=5? Based on the pattern we can see that it will be executed 1 + 2 + 3 + 4 + 5 times.

However, what we are after is the sum of the number of executions from i=1 to i=10. If we denote by C the total number of executions, we can write this mathematically as:

\displaystyle C = \sum_{i=1}^{10} \sum_{j=1}^i j

From the previous post we know that

\displaystyle C = \sum_{j=1}^i j = \frac{i(i+1)}{2}

Hence, we have

\displaystyle C = \sum_{i=1}^{10} \frac{i(i+1)}{2} = \frac{1}{2}\sum_{i=1}^{10} i(i+1)

Multiplying out the summand we have

\displaystyle C = \frac{1}{2}\sum_{i=1}^{10} ( i^2 + i )

Distributing the summation:

\displaystyle C = \frac{1}{2}\Large(\sum_{i=1}^{10} i^2 + \sum_{i=1}^{10} i\Large)

Let’s simplify this expression for a general n

\displaystyle C = \frac{1}{2}\Large( \sum_{i=1}^{n} i^2 + \sum_{i=1}^{n} i \Large)

We know that the second summation is equal to

\displaystyle \sum_{i=1}^{n} i = \frac{n(n+1)}{2}

The first summation is the sum of squares of the first n numbers. To evaluate this, let’s tackle the sum of cubes (*):

\displaystyle \sum_{i=0}^{n} i^3  = \sum_{i=0}^{n} (i+1)^3  - (n + 1)^3

You might be wondering how we arrived at the right-hand side. If you write the summation explicitly, say for n=5, the right-hand side looks like:

\displaystyle \sum_{i=0}^{5} (i+1)^3  - (5 + 1)^3 = (0+1)^3 + (1+1)^3 + (2+1)v + (3+1)^3+ (4+1)^3 + (5+1)^3 - (5 +1)^3
\displaystyle = 1^3 + 2^3 + 3^3 + 4^3 +5^3 = \sum_{i=0}^{5} i^3

Going back to the right-hand side, let’s expand it to get:

\displaystyle \sum_{i=0}^{n} i^3  = \sum_{i=0}^{n} (i^3 + 3i^2 + 3i +1) - (n+1)^3
\displaystyle = \sum_{i=0}^{n} i^3 + 3 \sum_{i=0}^{n} i^2 + 3 \sum_{i=0}^{n} i + \sum_{i=0}^{n} 1 - (n+1)^3

Canceling the term containing the i^3 and solving for the \sum_{i=0}^{n} i^2, we get:

\displaystyle 3\sum_{i=0}^{n} i^2= (n+1)^3 - 3\sum_{i=0}^{n} - \sum_{i=0}^{n} 1

The last term is just equal to n+ 1. Plugging this value and simplifying we get:

\displaystyle 3\sum_{i=0}^{n} i^2 = n^3 + 3n^2 +3n + 1 - 3\frac{n(n+1)}{2} - n - 1
\displaystyle = n^3 + 3n^2 + 2n - \frac{3n^2 - 3n}{2}
\displaystyle = \frac{2n^3 + 6n^2 + 4n - 3n^2 - 3n}{2}
\displaystyle = \frac{2n^3 + 3n^2 + n}{2}
\displaystyle = \frac{n(2n^2 + 3n + 1)}{2} = \frac{n(n+1)(2n+1)}{2}

This finally gives us

\displaystyle \sum_{i=0}^{n} i^2  = \frac{n(n+1)(2n+1)}{6}

Using this expression in the value of C gives us:

\displaystyle C = \frac{1}{2} \Bigg( \sum_{i=1}^{n} i^2 + \sum_{i=1}^{n} i\Bigg)
\displaystyle C = \frac{1}{2} \Bigg(  \frac{n(n+1)(2n+1)}{6} + \frac{n(n+1)}{2}\Bigg)
\displaystyle = \frac{1}{6} n(n+1)(n+2)

Substituting n=10, we get the total number of executions for the “Hello World”, which is

\displaystyle C = \frac{1}{6} 10 \times 11 \times 12 = 220.

There is a much easier way of solving this which we shall cover in the next post.

* See this page for more information on the sum of squares of the first n numbers.

January 6, 2011

How Do I love Thee? Let Me Count The Ways…

by Bobby Corpus

Counting is perhaps one of the very first skills we learn as a child. I remember, when I was small, watching Count von Count on Sesame Street teaching kids to count, in a rather frightening way! Even now, counting is still a very important skill for me in my profession as a programmer.

Counting allows me to determine if my program will run slow or fast. It allows me to determine if I need to a wait a couple of seconds, minutes, hours or even centuries for my computation to finish. Even with the proliferation of supercomputers with teraflops of computing power, an algorithm that is inherently slow will take forever to compute as the input data becomes large enough.

So How Do I Determine If My Algorithm Is Slow?

Let me illustrate the point by applying counting to the two Sorting Algorithms below. In both algorithms, the input is an array of unsorted numbers.

Algorithm 1: Find the least element in the list. This element is moved to the front. Then the first element among the remaining elements is found and put into the second position. This procedure is repeated until the entire list has been sorted.

Algorithm 2: Take the first element a1 and form 2 sublists, the first containing those elements less than a1, in the order they arise, and the second containing all elements greater than a1, in the order they arise. Then a1 is put at the end of the sublist. Repeat recursively for each sublist until all sublists contain one item.

When you sort a list, the thing which you always do is comparing one element to the next. This is a fundamental operation of sorting. And this operation is what we are going to count. The algorithm which executes the most comparisons will loose.

Analysis of Algorithm 1

Imagine we have an array A of n integers in unsorted order. According to algorithm 1, to sort this array, we first need to find the least element of array A. How many numbers do we need to examine in array A in order to determine the least element? The thing is, we will never know which element is the smallest unless we examine “every” number in the array. But there are n numbers in array A. Therefore, in the first iteration, the number of comparisons is n.

After the first iteration, we have identified the least element in the array A. In the second iteration, we apply the same technique of finding the least element in the remaining list of numbers. So how many comparisons do we need to identify the next smallest number? Since the remaining list contains n-1 numbers, therefore, the number of comparisons is n-1.

There is already a pattern emerging here. It seems that in the third iteration, we will need n-2 comparisons. This is indeed the case. So if we put our discovery in a mathematical form, the total number of comparisons C to sort the entire array is therefore:

\displaystyle C = n + (n - 1) + (n - 2) + ... + 2 + 1 = \sum_{i=0}^{n-1} (n -i )

We can rearrange the sum of terms such that the summation looks like this:

\displaystyle (n + 1) + (n - 1 + 2) + (n - 2 + 3) ... = (n+1) + (n+1) + (n+1) + ....

If n is even, then we have exactly \frac{n}{2} terms of (n+1), that is,:

\displaystyle \sum_{i=0}^{n-1} (n - i) = \frac{ n ( n + 1)}{2}

What happens when n is odd? For example, if n = 5, then

\displaystyle 5 + 4 +3 +2 +1 = (5+1) + (4+2) + 3 = 6 + 6 + 3 = 2(6) + 3

If you take a look at the rightmost side, you can see a pattern. It seems that for an arbitrary integer n, we have

\displaystyle \sum_{i=0}^{n-1} (n - i)  = \frac{n-1}{2} (n+1) + \frac{n+1}{2}
\displaystyle \sum_{i=0}^{n-1} (n - i)  = \frac{n^2 -1 + n + 1}{2} = \frac{n^2 + n}{2} = \frac{n(n+1)}{2}

This means that for any n, whether odd or even, the number of comparisons executed by Algorithm 1 is

\displaystyle \sum_{i=0}^{n-1} (n - i) = \frac{ n ( n + 1)}{2}

Observe that you can also write

\displaystyle \sum_{i=0}^{n-1} (n - i) = \sum_{i=1}^n  i= \frac{ n ( n + 1)}{2}

Analysis of Algorithm 2

On the average, we can assume that whenever we pick the first element a1, half of the list will be less than a1 and the other half is greater than a1. To make things simpler, let’s further assume that the number of elements n is a power of 2, that is, n = 2^m. This is will ensure us that whenever we halve, we always get a whole number.

So, on the zeroth iteration, the list will be split into two equal parts of size 2^{m}/2 = 2^{m - 1}. Below is what the input will be like after the first iteration.

Therefore, in the zeroth iteration, the total number of comparisons is 2^{m-1}.

In the first iteration, the input to the algorithm are the two lists we got above. Since the size of each list is 2^{m-1}, when we halve this list, we get 2 lists of size 2^{m-1}/2 = 2^{m-2}. This is also the number of comparisons we need for each list. Since there are 2 lists at this iteration, the number of comparisons at this iteration is therefore 2\times 2^{m-2}.

On the second iteration, there are 4 lists that are input to the algorithm. Each list has a size of 2^{m-2}. Halving each list, we get 2 lists of size 2^{m-3}. This is also the number of comparisons needed to half each input list. Since there are 4 lists in total at this iteration, the total number of comparisons is therefore 4\times 2^{m-3} = 2^2\times 2^{m-3}.

At this point, let’s pause for a while and see what we have:

Iteration Number # of Comparisons
0 2^0 \times 2^{m-1}
1 2^1 \times 2^{m-2}
2 2^2 \times 2^{m-3}

At this point, you will realize that a pattern is emerging. The pattern is this, for each iteration i, the number of comparisons is 2^{i}\times 2^{m-i -1}.

You can continue this pattern until the number of elements of each list is equal to 1. This will occur on the iteration i = m -1 . If you total the number of comparisons across all iterations, you will get the total number of comparisons C executed by the algorithm, which is

\displaystyle C = \sum_{i=0}^{m-1} 2^i\times 2^{m-i -1}

Applying the rule of exponents:

\displaystyle = \sum_{i=0}^{m-1} 2^{m-1}

Since the term to be summed is independent of m, this simply reduces to adding 2^{m-1} m times:

\displaystyle = m2^{m-1}

Since we assumed n = 2^m, then taking the base 2 logarithms of both sides we get m = \lg_2 m. Using this result above, we get

\displaystyle C = \frac{n \lg_2 n }{2}

The Winner

By counting, we were able to compute the number of comparisons for algorithms 1 and 2. Below is a table of the results of our counting:

Algorithm # of Comparisons for input of size n
1 \displaystyle \frac{ n ( n + 1)}{2}
2 \displaystyle \frac{n \lg_2 n }{2}

Here is a comparison for various values of n

      n algorithm1 algorithm2
1     1          1          0
2   100       5050        333
3  1000     500500       4983
4 10000   50005000      66439

For an input of size ten thousand, algorithm 1 will take 50 million comparisons as compared to algorithm 2 which takes only 66 thousand comparisons. And the winner is Algorithm 2!

December 28, 2010

Good Friends Are Hard To …Compute !

by Bobby Corpus

Almost 2 years ago, I ran an algorithm to produce the maximal cliques of my facebook friends. My experiment was made possible because of the old facebook REST api. You can find the description of my experiment from this site.

Recently, when I had 351 friends, I tried to do the same experiment, now with a much larger size. To my disappointment, it took forever to compute the maximal cliques.

Why would a network of size 81 take only 315 seconds to compute while a network of size 351 takes all of your patience away? The answer is in the time complexity of the maximal clique algorithm.

The maximal clique algorithm* is as follows:

\text{ALLCLIQUES} (l)
\text{global } X, C_l \text{ }(l=0,1,...,n-1)
\text{comment: every } X \text { generated by the algorithm is a clique}
\text{if } l=0 \text { then output}([])
\text { else output } ([x_0,...,x_{l-1}])
\text {if } l = 0
\text { then } N_l \gets V
\text { else } N_l \gets A_{x_{l-1}} \cap N_{l-1}
\text {if } N_l=\emptyset
\text { then } \{x_0,...,x_{l-1}\} \text { is a maximal clique}
\text {if } l = 0
\text { then } C_l \gets V \text { else } C_l \gets A_{x_{l-1}} \cap B_{x_{l-1}} \cap C_{l-1}
\text {for each } x \in C_l
\text {do } \begin{cases} x_l \gets x \\ \text{ALLCLIQUES} (l + 1) \end{cases}

How this works

Before you call this algorithm, you first number your vertices from 0 to n-1. This is to allow the algorithm to sort the vertices according the number that was assigned to them.

This algorithm is called with the value of l=0. At this point, the algorithm will output the string “[]“, which stands for an empty clique. The set N_l is initialized to the entire vertex set V, the choice set C_l is also initialized to V. The sets A_v and B_v are precomputed before the algorithm starts and are defined as follows: (Here, E is defined as the set of vertices)

A_v = \{ u \in V: \{ u,v\} \in E\}
B_v = \{ u \in V: u > v \}

What this means in layman’s terms is that the set A_v is the set of all vertices connected to v and the set B_v is the set of all vertices whose number is greater than v.

Using the definition above, the choice set C_l is defined as:

\displaystyle C_l = A_{x_{l-1}} \cap B_{x_{l-1}} \cap C_{l-1}

The set N_l is defined as

\displaystyle N_l = N_{l-1}\cap A_{x_{l-1}}

When N_l = \emptyset, we say that the generated set of vertices, X=[x_0,...,x_{l-1}] is a maximal clique.

For example, suppose we have the following graph:

Running the algorithm over this graph would generate:

The “*” indicates that the generated vertices form a clique.

So how do we answer the question: “Why would a network of size 81 take only 315 seconds to compute while a network of size 351 a very long time”? We compute the running time of the algorithm.

Average Case Analysis

Let n be the set of nodes. The number of ways you can connect one node to another node is {{n}\choose{2}}, these correspond to the number of possible vertices which you can make a graph G. Given this set, the power set corresponds to the number of possible graphs of you can create using n nodes. The cardinality of this power set is \displaystyle 2^{{n}\choose{2}}. Let us call this power set G(n).

Let G \in G(n) be an arbitrary graph and define c(G) be the number of cliques in G. By the recursive nature of the algorithm, it produces a tree, called the state space tree, for which the number of nodes is equal to c(G). Since the algorithm is invoke recursively n times, the running time of this algorithms is O(nc(G)). It remains for us to compute c(G) in the average case.

Denote by \bar c(n) the average clique size, then

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum _{G\in G(n)} c(G)

Let’s dissect the above formula. Keeping in mind what you learned in elementary math, the average value is computed over the total number of graphs is G(n), which is equal to 2^{{n}\choose {2}}. The numerator of the formula is just the sum of the number of cliques of all graphs G \in G(n). That’s how we get the formula above. Although, this does not help us compute the average complexity on first blush. We have to express this explicitly. For example, how do we compute c(G), which is the number of cliques of graph G? A more algorithmic way of counting the number of cliques of G is to get all possible set of vertices W of G and for each set W determining if it is a clique or not. Then we just count the number of cliques to get the value of c(G).

Let W be a subset of the set of vertices V of the graph G. Define a function, \chi(W) which returns 1 if W is a clique and 0 if W is not a clique:

\chi (G,W) = \begin{cases} 1 & \text{if W is a clique},\\ 0 & \text{if W is not a clique}.\end{cases}.

With the definition above, we can then express c(G) mathematically as:

\displaystyle c(G) = \sum_{W \subseteq V} \chi(G,W)

Plugging this into the formula for \bar c (n), we get

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum _{G\in G(n)} \sum_{W \subseteq V} \chi(G,W)

By the summation property, we are able to switch the order of the summation:

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum_{W \subseteq V} \sum _{G\in G(n)} \chi(G,W)

Let’s now focus on the inner summation. It is the summation of the value \chi(G,W) over all graph G. Remember that \chi(G,W) is 1 if W is a clique of G for a specific W and 0 if not. Imagine W to be a clique. Then all its edges are contained in G. How many are the edges of W then? It is the number of vertices in W, which is the cardinality of W, given by |W| taken 2 at a time, or 2^{{|W|} \choose {2}}. How many possible cliques can you form with W as a subset? To compute this, first note that the remaining edges of G, if we take away the edges formed by W is \displaystyle {{n}\choose {2}} - {{|W|}\choose {2}}. If we take the power set of this remaining edges union each subset of it with W, we will get all possible graphs with W as a clique. Therefore,

\displaystyle \sum _{G\in G(n)} \chi(G,W) = 2^{ {{n}\choose {2}} - {{|W|}\choose {2}}}

Plugging this into the expression for \bar c(n) will give us:

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum_{W \subseteq V} 2^{ {{n}\choose {2}} - {{|W|}\choose {2}}}

Let us now attack this expression. Let’s imagine W has k vertices, i.e., |W| = k. Then there are {{n}\choose {k}} subsets of V of size k. The summation can now be written as:

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum_{k=0}^n {{n}\choose {k}} 2^{ {{n}\choose {2}} - {{k}\choose {2}}}

By canceling the factor {2^{{n}\choose{2}}}, we get:

\displaystyle \bar c(n) = \sum_{k=0}^n {{n}\choose {k}} 2^{ - {{k}\choose {2}}}

If we let t_k = {{n}\choose {k}} 2^{ - {{k}\choose {2}}}, we can simplify the expression of \bar c(n) to

\displaystyle \bar c(n) = \sum_{k=0}^n t_k.

Below is the graph of t_k versus k.

As you can see from the graph, there is a value l such that t_l \ge t_k for all k. This means that

\displaystyle \bar c(n) = \sum_{k=0}^n t_k \le \sum_{k=0}^n t_l = \underbrace{ t_l + \ldots + t_l }_{n + 1 \text{ times}}= (n + 1) t_l.

Therefore,

\displaystyle \bar c(n) \le (n+1) t_l.

It’s just a matter of getting the value of t_l. If we take the ratio of successive terms in the series, we get

\displaystyle \frac{t_k}{t_{k-1}} = \frac{n - k + 1}{k2^{k-1}}

The details of the computation can be found here.

Notice that past the point l, the value of the above ratio is less than 1, or

\displaystyle n < k-1 + k 2^{k-1}.

For the sake of simplicity, if we let k=log_2n, we have

\displaystyle n < log_2 n - 1 + log_2 n 2^{log_2 n -1 }
\displaystyle = log_2 n - 1 + \frac{nlog_2n}{2}

This inequality is true if the value of log_2 n \ge 2 because n < 2 - 1 + n = n + 1 . The implies that l \le log_2 n. Using this upper bound, we can compute the upper bound of t_l by the following process. First, observe that

\displaystyle t_l = {{n}\choose{l}}2^{-{{l}\choose {2}}} < {{n}\choose {l}}.

But

\displaystyle {{n}\choose {l}} = \frac{n!}{l! (n-l)!}= \frac{\overbrace{ n\dot (n-1)... (n-l +1)}^{l \text{ times}}}{l!} < \underbrace{n\cdot n ... n }_{l \text { times}}= n^l

Since we have shown above that l \le log_2 n,

\displaystyle t_l < n^l \le n^{log_2 n}

from which \displaystyle \bar c(n) \le (n+1)n^{log_2 n}. The average running time of the all-clique algorithm is therefore O(n^{log_2n+2}). For a graph with 81 vertices, the average number of operations is 8.250472e+15. Since it takes 315 seconds to compute this graph on my machine, my computer was able to compute 2.619197e+13 operations, on the average, in one second. For a graph with 351 vertices, the number of operations on the average is 4.092784e+26 which my computer can do in 1.562610e+13 seconds, or 495,500 years!

* For more information, read the book of Donald Kreher and Douglas Stinson entitled “Combinatorial Algorithms: Generation, Enumeration and Search”. You can find the algorithm and the analysis on page
112.

Follow

Get every new post delivered to your Inbox.