Wednesday 11 September 2013

Heap : A Data Structure for Efficient Programs


With exponential growth in computing capability of present day hardware as per Moore's law one can overlook the importance of efficient algorithms very easily. But efficient algorithm will always be critical as the complexity of programs running on today's efficient hardware has also increased with same pace if not more. 
Today people are dealing with huge social network graphs, or analysing twitter posts, or searching images, or solving any of the hundreds of problems in vogue. They would be wasting time without the fastest possible hardware. But if they use incorrect algorithms they would be sitting forever. Hence there is no substitute for an efficient algorithm.

 
And key to efficient algorithm is use of efficient data structures. Heap is one such data structure which is know for its efficiency. In this post I would attempt to explain basic concepts and operation applicable to a heap and more importantly discuss some important application of a heap for solving real examples. A VC++ workspace with source code is also provided containing a C++ implementation of a heap and sample use of a heap for Heap Sort, Priority Queue, a process scheduler and and algorithm to efficiently find top k element from a huge array.

Introduction to a Heap

Formal Definition
A heap is a binary tree T that stores a key-element pairs at its internal nodes
It satisfies two properties 
  • MinHeap: key(parent) <= key(child), [Or MaxHeap: key(parent) >= key(child)]
  • All levels are full, except the last one, which is left-filled
As mentioned above there can be two types of Heap- A Max Heap or Min Heap. The diagram shown at the start of this post is a Max Heap and its Heap Property mandates that key(parent) >= key(child) for all nodes which brings us to a very important property.
'The Root node is the the node with largest value in a Max Heap' Similarly 'The Root node is the node with lowest value in a MinHeap'

Several algorithm tend to exploit the above property of heap to solve their problem in an efficient way. A heap is usually implemented as a one dimensional array and the tree representation we often see for illustration is a logical mapping of an Array to a Binary Tree Data structure. 
For example if we take an Array A = [a,b,c,d,e,...] Then element at A[0] is the Root of a Heap and element at index '1' and '2' are child of the root. Similarly element at '3' and '4' are child of element at '1'. Going by this logic and assuming the array start with index '0', we define the following.

Node Index of Left Child   =  (2*parent_index + 1)
Node index of right Right   =  (2*parent_index + 2) 
Give a index of a child, Node index of Parent = floor of(child_index / 2)

We use the above properties to navigate Up and Down the Heap.

Heap Operations

A heap supports following operations:
1. Build Heap : Give an array perform necessary operation to satisfy the heap property.
2. Heapify: This is an internal function of a heap that given a node restores the Heap property recursively by navigating in downward direction.
3. Add Element: Adds an element to end of Heap Array do necessary alteration to restore the heap property.  
4. Peek Front : Returns the root(the largest element of a Max Heap) of Heap without removing it.
5. Pop-Front: Removes the the root from heap and perform necessary alteration to restore the heap property.

Pop Front and Add element methods are depicted in the following diagram.

C++ Implementation of a Heap

The following file (heap.h) contains a templatized implementation of a Heap supporting all the basic operation. The class support two constructor that can create, a Heap starting from a given vector or it can start with an empty heap and elements can be added one by one to it. The provided class can be used to create a MinHeap or MaxHeap  input boolean variable to the constructor. 
The element to be used in heap as Heap nodes should support overloaded '<' and '>' operations. 


Application of a Heap

Heapyfiy and Shift Up operation run with the time complexity of Log2(N). Several algorithm take advantage of this fact and implement their logic that runs with time complexity of  Log2(N) times cone constant. We will discuss four application of heap ie Heap Sort, Priority Queue, Algorithm for finding top K elements and a Process Scheduler.

Heap Sort: 

Heap data structure can be used to implementing heap sort which is an in place sorting technique and it runs with an average and worst case time complexity of NLog2(N). Though it is slightly slower wrt a quick sort but it performs better in its worst case scenario.  The Pesudo code for a Heap Sort is
             Heapsort(A) 
             {
              BuildHeap(A)
              for i <- length(A)-1 downto 1 {
                  exchange A[1] <-> A[i]
                  heapsize <- heapsize -1
                  Heapify(A, 0)
             }

            BuildHeap(A) 
            {
              heapsize <- length(A)-1
              for i <- floor( length/2 ) downto 0
                Heapify(A, i)
            }

The source code of Heap Sort is present in void heap_t<T>::sort_heap() function of (heap.h)

Priority Queue

Priority is nothing but a wrapper on top of heap and it supports some basic operation that are already implemented n a Heap. The operation of a priority queue are push(elem), pop(), top(), size() and isempty(). The priority queue exhibits one important property that the element with highest priority or value is always present at the front of the queue. Hence this is nothing but a Max heap. Elements can be added to the queue and extracted from front. The element with highest priority always comes first in the priority que.
The C++ source code implementation of a priority queue is present in the following source file(priorite_queue.h).

Finding Top 'k' elements from an Array of length ' n'

Today we see several instances of top 'k' searches such as top 5 popular posts, page ranking, top few comments etc etc. Top 'k' element in a given large array can be implemented in several ways such as sort the array first using quick sort and extract top 'k' elements from the array. The best possible algorithm using search technique will yield result in NLog2(N). But a min heap can be used to accomplish this task more efficiently. The time complexity for heap based algorithm is NLog2(K). And since 'k' is usually very smaller than 'N', the overall complexity yield major performance differences.
The algorithm for using a heap for max k element is:

    //Create an empty Heap
    heap_t<int> intMinHeap(false);
    for(unsigned int i=0;i<bigArray.size();i++)
    {
        if(intMinHeap.size() < k || bigArray[i]>intMinHeap.top())
        {
            if (intMinHeap.size() == k)
            {
                intMinHeap.pop_front();
            }
            intMinHeap.add_element(bigArray[i]);
        }
        //intMinHeap.print_heap();
    }
Complete source code is available in following file(max_n_element.h).

A Process Scheduler

A process schedules depicts the usage of a priority queue which is in turn implemented as a Heap. The problem statement is very simple. We want to build a scheduler that can be used to schedule activities in future at some desired time(represented as a clock). The scheduler will be ticked as time progresses. Now once scheduler reaches a time at which some process is scheduled, then scheduler extracts that process from the queue and execute it. At any time a process can be scheduled and an unknown number of process will be present at any time in the scheduled queue.
Scheduler can very well maintain a sorted que of process, but that wont be efficient. Hence a heaps suits the job the best wrt complexity and guaranteeing the functionality. The scheduler support APIs like insert(process), remove(process), tick(), reset(), get_clk() etc.
The source code of a process scheduler is present in files (process_scheduler.h) & (process_scheduler.cpp).

Conclusion

In this post we visited important properties of a Heap and depicted some most common usage of a heap with their corresponding source code implemented in C++. Heap is a critical data structure for writing efficient programs but most often it is neglected and ill understood by day to day programmers. I have tried to give example of some usage and hope it will help the readers in understanding the data structure better and adopting it their code.
The complete VC++ project is available in the following link(heapWKS.zip).

Take care..

No comments:

Post a Comment