Algorithm Efficiency

Testing algorithms

In the two previous sets of notes we saw a variety of algorithms for sorting arrays and arrayLists of data. You were given programs where the examples could be given, or were randomly generated and displayed, so you could see for yourself that the algorithms appeared to work correctly. You were also given programs where the same code was given to do the sorting, but no display was given. Rather, a collection was randomly generated and sorted, and you were just given the timing figures for the sorting. You can use these to test the algorithms on large collections of data, much more than could be displayed textually on a screen.

Here is a complete list of all the "timing figures only" sort programs:

When you run any of one of these programs for a large collection of randomly generated data, you will find the time does not vary much each time you run it so long as the amount of data being sorted is the same. But the time to sort a particular amount of data varies considerably depending on which algorithm is used. If you change the range of integers being sorted, say integers from the range 0-999 to all integers in the range 0-999,999, it doesn't make much difference (slightly quicker for a smaller range). If you sort collections of strings rather than collections of integers, sorting takes longer, but the algorithms which sort collections of integers quickest also sort collections of strings quickest. If you use the programs on different computers, you will find the same applies. Changing computer does not change which algorithm works the best. What this suggests is that the algorithm used is the important factor, not the data used, or even the computer used.

On the computer I'm using at present, if I run the insertion sort programs for collections of 10,000 integers, it is quick enough so that the sorting seems to take place instantly. But if I run them for collections of 20,000 integers, the delay taken to sort the collection is long enough to notice. If I run them for collections of 100,000 integers, I have to wait for over a minute for the collection to be sorted.

With the quick sort algorithms, I find I only have to wait a few seconds to sort a collection of 1,000,000 integers. Consistently, quick sort and merge sort work faster than insertion sort and selection sort, and the difference is very noticeable for large amounts of data. The sort methods provided by the Java system take about the same amount of time as the quick sort and merge sort algorithms. So we can guess there isn't some other sorting algorithm which is much faster, otherwise those who put together the Java code library would have used this algorithm for its own sort methods.

So is this the way to choose between algorithms - program them, run them, and see how they perform? We should always test our code by running it, to see if it seems to work correctly and within reasonable time limits. But with algorithms, just as we can use mathematical techniques to prove they work correctly, so we can use mathematical techniques to show how their efficiency compares.

In this course, the emphasis is on understanding and coding algorithms. A more theoretical course would place the emphasis on analysis of them, proving their correctness and efficiency. In this set of notes we will just show very briefly and quite informally the sort of reasoning that could be taken much further in a course on the analysis of algorithms.

Binary Search

Before considering the efficiency of sorting algorithms, let us consider the simpler problem of searching for an item in a collection. We have previously seen methods for searching for an item in an array, that is finding whether a given item is in an array or not. Code to search an array (adapted for an array with base type Integer, but otherwise similar to before) is:
 public static boolean search(Integer[] a,Integer n)
 {
  int i;
  for(i=0; i<a.length&&!a[i].equals(n); i++) {}
  return i<a.length;
 }
You can find this code and support code to demonstrate it in the files Find1.java and Find2.java, the first printing the array contents, the second giving timing figures. All it does is look through each of the items in the array in turn, until either it has found the one it is searching for, or it has gone through the entire array.

But suppose we are searching for an item in an array where the items are stored in order? We might note that if we are looking through the items one at a time starting at the first one, which will be the smallest one, and find one greater than the one we are looking for before we find the one we are looking for, then we can stop searching, knowing that the item cannot be in the array. Here is code which does this:

 public static boolean search(Integer[] a,Integer n)
 {
  int i;
  for(i=0; i<a.length&&a[i].compareTo(n)<0; i++) {}
  return (i<a.length&&a[i].equals(n));
 }
You can find this code and supporting code to demonstrate it in the files Find3.java and Find4.java.

On average, you now only have to go through half the array to show an item is not present, whereas previously you had to go through the whole array. So there is some improvement in efficiency.

But a better improvement in efficiency when searching for an item in a sorted array is to consider another algorithm, called binary search. Suppose we look at the middle item in an array first. If it is the item we are searching for, we have found it. If the item we are searching for is less than the middle item, we can continue searching, but this time we search only those items before the middle item. If the item we are searching for is greater than the middle item, we can continue searching, but this time searching only those items after the middle item. We know the item is not in the array if the portion of the array being searched narrows down to nothing. Here is code to do that:

 public static boolean search(Integer[] a,Integer n)
 {
  return search(a,n,0,a.length);
 }

 private static boolean search(Integer[] a,Integer n,int from,int to)
 {
  if(from==to)
     return false;
  else
     {
      int mid = (from+to)/2;
      int res = n.compareTo(a[mid]);
      if(res==0)
         return true;
      else if(res<0)
         return search(a,n,from,mid);
      else
         return search(a,n,mid+1,to);
     }
 }
You can see that the work is done in the four-argument method which takes the additional arguments of the indexes giving the portion of the array being searched. An alternative way of expressing this using iteration rather than recursion is:
 public static boolean search(Integer[] a,Integer n)
 {
  int from=0, to=a.length;
  while(from!=to)
     {
      int mid = (from+to)/2;
      int res = n.compareTo(a[mid]);
      if(res==0)
         return true;
      else if(res<0)
         to=mid;
      else
         from=mid+1;
     }
  return false;
 }
You can find the recursive code and supporting code in the files Find5.java and Find6.java, and the iterative code and supporting code in the files Find7.java and Find8.java.

You will find this binary search algorithm much quicker than the linear search. Even when searching an array of a million items, it appears to achieve it instantaneously, quickly enough on my computer for the timing to be less than one millisecond.

The reason for the quickness is that this algorithm doesn't just cut the amount of the array being searched by half, it does that repeatedly. It starts off with an array, cuts it in half to search half the array, cuts that in half to search a quarter of the array, and so on. It doesn't have to do any work to cut the array in half, it just changes the values of the integers giving the bounds. The main work done is to look at the middle item each time and compare it with the item being searched for. So the number of times the item being searched for is compared against an item in the array is, at most, the number of times the size of the array can be cut in half. Even with 1,000,000 (a million), cutting that in half gives 500,000, cutting that in half gives 250,000, cutting that in half gives 125,000, and so on - reducing to 1 after 20 halvings. So that means at most 20 comparisons to find if an item is in an ordered array of a million items. Compare that with the algorithm which looks at each item in turn until either the item is found or the item looked at is greater than the item being looked for. On average it would look at 500,000 items to determine whether an item is in the array or not.

Note the efficiency has nothing to do with the fact that binary search can be expressed using recursion. Iterative search can be expressed using recursion as well:

 public static boolean search(Integer[] a,Integer n)
 {
  return search(a,n,0);
 }

 private static boolean search(Integer[] a,Integer n,int pos)
 {
  if(pos>=a.length)
     return false;
  else if(a[pos].equals(n))
     return true;
  else
     return search(a,n,pos+1);
 }
Here, the work is done in the three argument method, where search(a,n,pos) can be interpeted as "return true if the integer n is in the portion of the array a starting at position pos and going to the end, return false otherwise". If pos is equal to or greater than the length of the array, there is no portion of the array to search, so the answer must be false. If it happens that the item indexed by pos in the array is the integer n, then we can say right away that it is in the portion of the array starting at the position indexed by pos. Otherwise it is in the portion of the array starting at the position indexed by pos if it is in the portion of the array starting at the position indexed by pos+1. The files Find9.java and Find10.java can be used to run this code. You will find if you run the code for a large array (for example, 30000 elements) it will halt with a java.lang.StackOverflowError. The reason for this is that in practive Java places a limit on the depth of recursive calling. As here, unlike binary search, each recursive call only reduces the size of the portion of the array being searched by one, the depth of recursion will get to be equal to the size of the array.

Efficiency of Sorting Algorithms

Consideration of algorithms to search for an item in an ordered array shows that the choice of algorithm can make a huge difference in efficiency. Sorting algorithms are more complex, but we can use a similar sort of analysis to show why some always seem to run faster than others. Let us consider selection sort first. If we are sorting an array of length N, then to find the lowest item we have to go through the whole array making N-1 comparisons between the lowest item found so far and the next item in the array. Then the lowest item is put in the first place, and we have to find the second lowest item from the rest of the array, which involves N-2 comparisons. We put that in the second place and have to make N-3 to find the third placed item, and so on to the point where we are making just 1 comparison to determine which is the second highest item and which is the highest. So, in all that's (N-1)+(N-2)+(N-3)+...+3+2+1 comparisons.

If you're already familiar with the mathematical idea of series, you will know this adds up to N2/2 comparisons. If you're not, consider the following: adding the (N-1) and the 1 gives N, adding the (N-2) and 2 gives N, and so on, so there are N/2 pairs each of which add to N to produce a total sum of N2/2.

Selection sort is particularly easy to analyse as the number of comparisons is always the same no matter what the data. Insertion sort is a bit more complex to analyse as the number of conparisons depends on the data. First there is one comparison to see if the second item should come before the first item. Then the third item must be compared with the second item, and only of it is smaller also compared with the first item. Then the fourth item must be compared with the third item, and if it is smaller with the second item, and if it is smaller than that with the first item. So if insertion sort is applied to an array which is already in order, each item will only have to be compared to the one before it, and there will be N-1 comparisons if there is a total of N items. If insertion sort is applied to an array which is in reverse order, each item will be compared with all the items before it, so there will be 1+2+3+...+(N-3)+(N-2)+(N-1) comparisons, which makes a total of N2/2 comparisons as we showed above. If the data in the array is in random order, each item will be compared on average with half the items before it, making a total of N2/4 comparisons. So with insertion sort, we have a best case, a worst case and an average case. It is important to note that the best case and worst case only apply when the data is in some sort of order. If the data is in a random order, then the efficiency is very unlikely to be anything but the average case.

Now let us consider merge sort. This works by dividing the array into two equally sized pieces, sorting them and merging the two sorted arrays into one. If the data was originally in random order, merging the two halves will require nearly N comparisons, though the best case is if the data was originally ordered, in which case merger would require only N/2 comparisons. We also need to consider the number of comparisons required to sort each half. Each half is divided into two pieces of length N/4, and the two pieces are sorted and merged. The merging of two pieces of size N/4 to make an ordered piece of size N/2 requires up to N/2 comparisons, and there are two mergers of N/4 pieces to make N/2 pieces. so that's a total of N comparisons. Following from this, you can see there are four mergers of N/8 pieces to make N/4 pieces, so a total of N mergers, and so on. This goes down to the point where the pieces are of length 1 so no sorting and merger is required. This point is reached when the original array has been divided a total of M times, where M is the number of times N can be halved. So in total there are up to M times N comparisons to sort an array of length N using merge sort.

As we noted before, 1,000,000 (one million) can be cut in half twenty times before it reaches the size of 1. So to sort an array of a million items using merge sort, up to 20,000,000 (twenty million) comparisons will have to be made. But compare this to insertion sort, where we noted that with randomly ordered data, an array of length N would require N2/4 comparisons. For an array of length 1,000,000 that's 250,000,000,000 comparisons (two hundred and fifty thousand million, or two hundred and fifty billion). So you can see that merge sort is much more efficient than insertion sort.

To work out the efficiency of quick sort we think in a similar way to working out the efficiency of merge sort. With quick sort, the comparisons take place before the recursive calls, not after, but otherwise the reasoning is the same. There are N comparisons required to divide the original array of N elements into two pieces of size N/2. A similar process applies to each of these, so there are two lots of N/2 comparisons to divide the two N/2 sized pieces into four N/4 sized pieces, and so on. So long as each division is into approximately equal sized pieces, as with merge sort therte are a total of about M times N comparisons to sort an array of length N where M is the number of times N can be halved for it to reach 1.

Quick sort, however, has a bad worst case. Suppose it is applied to an array which is already ordered, and as we have done, the first item is taken as the pivot. Then the array will be divided into two unequal parts - one empty (all those items less than the pivot), and one of length N-1 (all those items greater than the pivot). This requires N-1 comparisons, then the N-1 part is similarly divided requiring N-2 comparisons and so on down to one comparison. So in this worst case, quick sort requires N2/2 comparisons.

The sort algorithm which the Java library uses in its built-in methods is in fact merge sort, and this is in order to avoid execution becoming unexpectedly slow if the sort method is applied to an array which is already sorted.

How many times can we halve a number?

In the above section we said that M was the number of times we could halve N before we reached 1. If N is 8, then halving it once gives 4, halving it twice gives 2 and halving it three times gives 1. If N is 16, halving it four times gives 1. You can see that if N is 2M then M is the number of times N can be halved. Remember that 2M is 2*2*...*2*2 where M 2s are multiplied together.

If N is in between 2M and 2M-1, then repeated halving will eventually lead to a point where an odd number has to be halved, and since we are dealing with integers only, that will be into two pieces with one piece one greater than the other. Consider halving the larger piece each time this happens, then if N is less than or eqal to 2M and greater than 2M-1, it can be halved M times.

In mathematical terms, M is "the logarithm of N to the base of 2", which is written as log2N. In general, if ab=c then the value b is described as "the logarithm to the base a of c", which is written logac.

From this we say that binary search is "of order log N" meaning that for an array of N elements it takes log N steps to execute for an array of length N. We say that linear search is "of order N" meaning it takes around N steps to execute. Selection sort and insertion sort are "of order N2" because they take around N2 steps to execute. Merge sort and quick sort are "of order N log N", meaning N times the logarithm of N because that is a measure of the number of steps taken to execute. Note that when we give these figures, we ignore constant multiplying factors. We don't need to give the base of the logarithm, since the logarithm any base of a number is the logarithm to another base of the same number multiplied by a constant factor.

Often "order N" will be written O(N), "order log N" will be written O(log N) and so on. For this reason, the order of an algorithm is sometimes called its big-O notation. We could be much more precise about exactly how this sort of efficiency figure is calculated, but for the purpose of this course, this is as much as you need to know.

Note that if an algorithm is described as O(1) it means the time taken to execute does not depend on the amount or size of the data. In Java, accessing any element of an array given its index (that is, finding or setting a[i]) is O(1). Algorithms which make use of getting or setting components of an array rely on knowing that this can be done in a single step. We could not, for example, apply the analysis we have applied above to similar algorithms on Lisp lists, for example, since we know that to find the ith element of a Lisp list we have to go through i steps.

In general, an O(1) algorithm is the most efficient, followed by O(log N), followed by O(N) followed by O(N log N), followed by O(N2). Algorithms of O(log N) are also known as "logarithmic". Algorithms of O(N) are also known as "linear". Algorithms of O(N2) are also known as "quadratic". You may also come across algorithms of O(N3) or "cubic", which are particularly inefficient. An even more inefficient class of algorithms is those of O(2N), known as "exponential". Recall that 220 is over a million, so an algorithm of O(2N) would be taking around a million steps for just for 20 data items.


Matthew Huntbach

Last modified: 24 February 2006