Thoughts on development & design: algorithms

Showing posts with label algorithms. Show all posts

Previously we looked at an introduction to graph data structures and designed a very basic Graph implementation. I also mentioned that the main things we would likely want to do with a graph is search/explore the graph.

There are two primary tools that we will use to explore graphs - these are basic computer science concepts, and should be familiar to everyone who has studied computer science at uni and faced graphs before.

Depth First Search (DFS)

The concept behinds DFS is that given a starting point or root node, we will
search as deep as we can one route before backtracking (e.g. select a neighbour to the root node, visit that neighbour, then select a neighbour from that node and visit - continue this until we reach a node that we cannot follow any more edges from, and then backtrack up the graph considering alternate neighbours at each step)

(Image from wikipediea)

DFS is pretty simple to implement and nataurally uses a Stack data structure to keep track of the backtracking (the easiest way to do this is to solve recursively using the implicit Stack)

Below is a simple Java implementation of DFS using recursion to handle the backtracking.

Breadth First Search (BFS)

This search takes the different approach of looking as wide as possible before moving down a level. For example, we will visit all the immediate neighbours of the root, then select one of the immediate neighbours and visit their immediate neighbours - then backtrack up to visit other neighbours at this level.

(image also from wikipedia)

BFS is also fairly simple, and pretty close to DFS but it uses a Queue (FIFO) structure rather than a stack to visit nodes from the root down first. Below is example code of BFS implemented in Java.

With most problems it is largely pretty clear which data structure is going to be appropriate to use - If you just care about storing a list and iterating over it, then you probably want an Array based structure - if you particularly care about the elements being unique you could look as a Set. If you want to capture a key-value pair dictionary data structure then you can use a Map.

Graph data structures are no different, and are very applicable to a subset of problems, and once familiar with graphs and the common algorithms it becomes quite easy to quickly identify the problem type as a graph problem.

The basics

A graph has two main elements:

Node - a given data point in the graph
Edge - a connection that joins any two Nodes. A graph can be "directed" or "un-directed" - this simply determines whether the Edge goes both ways or is purely one way.

A graph is a data strucutre that stores a set of connected elements - the easiest way to understand the structure is with a real world example, the most famous is probably like the social-network graph. If you think about your profile on any popular social network (Facebook/LinkedIn/etc), your profile is a Node in the graph, and each of your friendship/connections is an Edge to another Node in the graph

A while ago, a Facebook intern created a visualisation of the Facebook social graph around the world - you can't really make out the individual Nodes/Edges, but you get the idea.

In Facebook's graph, it is un-directed, that is, when you become friends with another Node in the graph the relationship goes both ways - you are their friend and they are yours.

Twitter, however, is a directed graph - once you follow someone an Edge is created between your Node and theirs, but they don't automatically follow you as well, so the Edge has direction.

If you are a LinkedIn user, you may have noticed whilst browsing another user's profile a widget saying something like

"X of your connections can introduce you to someone who knows Y"

What LinkedIn is telling you, is that the shortest path between your Node and Y's Node in the graph is 3 Edges (3 "hops" - following an Edge between nodes is often called a "hop") - To be able to do this, LinkedIn is searching the social graph space to discover the shortest path (and number of unique paths that are of the shortest distance) between you and this other user.

Graph representations

There are two primary graph representations:

Adjacency Matrix - This is a matrix/2-D array that captures the relationship between every node. Every node is mapped against the X and Y axis, and the value in the intersecting cell determine if their is an Edge between the nodes. E.g. if we wanted to know if there was an edge between Node "4" and node "13" we would look at matrix[4][13] - the value there would tell us. Normally, value of 1 represents an edge, but other values can be used (for example, if it is a weighted graph the values could represent the weights, or if it is a directed graph it could use -ve/+ve values to represent direction). This representation is good for "dense" graphs.
Adjacency List - This representation is simply a List of all Edges and a List of all Nodes. This is a simple representation and is more memory efficient for "sparse" graphs

For the code samples here, I will focus on the Adjacency List representation

Graph representation - Java

Below is a very rudimentary implementation of a Graph class in Java. It uses a Adjacency List representation and will be used in later examples I go through.

Below is a sample unit test setup that shows how a simple graph can be initialised:

In the next post I will go through basic techniques for searching and exploring graph spaces, as well as a post looking at how to solve the LinkedIn shortest path recommendation problem.

And now the same for a Quick Sort I implemented. Normally Quick Sort also runs in O(nlogn) time, but its worth noticing that the implementation below just uses the first element as the pivot value, which is not an optimal pivot (performs very badly in partially sorted lists for example), so will leave it to you to think about better ways to choose the pivot value.

I wrote a Java implementation of Merge Sort a little while ago, just for fun really. I was just about to close the file, noticing it was still open in my IDE, and thought I might as well just post it here quickly. It might be interesting for someone along side the Merge Sort analysis I previously wrote up.

Also sometimes called an associative array, symbol table or dictionary - Maps are collections of key-value pairs and are probably one of the other most common data structures you might come across day-to-day.

HashMap

The most commonly used Map implementation in Java is probably the HashMap. The HashMap makes use of the equals() and hashCode() method on Java's Object API.

The basic premise is that the HashMap has a collection of "buckets", each which can hold several objects. When an object is added to a HashMap the hashCode() method on the key is used to select the bucket to use, the object is then added in that bucket. For retrieval, its the same process - hashCode() is used to determine the bucket, then the entries in the bucket are inspected and the equals() method is used to determine the match. In Java's HashMap, the buket is essentially implemented as a linked list (not a LinkedList - but Entry<k,v> has a pointer to the next entry)

The obvious implication of this, is that the performance is dependent largely on the design of a good equals() and hashCode() method. For example, if you designed a hashCode() method that always returned the constant 1 (which would be legal, as the Java contract is that if two Objects are equal() then they must have the same hashCode(), but if two Objects have the same hashCode() they do not need to be equals()) - then it would mean all entries would be put in a single bucket.

The hashCode() method

Designing a good hash code implementation is very important - for performance (see this stackoverflow discussion on the performance impact of large HashMaps with poor hashCode implementations), but also if your hash code is erroneous then your HashMaps might just not work and you may insert objects in your Map and never be able to retrieve them (if hashCode() doesn't return consistent values for example, it could be placed in a bucket, then when trying to retrieve it generates a different hashCode() so looks in a different bucket).

If you know the complete key set, and it fits in to the Integer range (hashCode() returns an int) then you could design a perfect hashing algorithm that allows every unique key to have its own bucket, so guarantees O(1) time for insert/retrieve. However, in practice this is also quite unlikely, so ideally want to design for as even a spread across buckets as possible.

Performance

Due to the dependency on the implementation of the objects used as keys, and the data set, the worst case vs best case performance is varying.

Search/insert/remove - All these operations suffer the same problem - in best/average case performance these can be done in constant O(1) - However, the worst case (all elements in one bucket) the performance drops to linear O(n)

In practice, HashMaps are usually more efficient than search trees and other look ups, which is why they are very commonly used.

A Stack data structure is a Last-In-First-Out (LIFO) list. There is a Stack<T> Interface, but the recommended Java structure is the Deque (another interface featuring an Array and Linked implementation).

A Queue data structure is simply the opposite, First-In-First-Out (FIFO) structure. The current Java recommendation is also to use the Deque (noramlly pronounced "deck" if you were interested, and stands for Double-Ended Queue)

Having read the discussion of ArrayList vs LinkedList, many of the same considerations apply - but given the common use-pattern of stacks/queues, the different implementations make sense.

Deque

Java's Deque implements the Queue interface, and can be used as either a Queue or a Stack, offering methods appropriate for either use.

ArrayDeque vs LinkedDeque

Similar to ArrayList, the Array based implementation is the most popular, and, by-and-large the most recommended implementation to use.

Based on what we already know about ArrayList and LinkedLists, and what we know about Stack vs Queue behaviour, there would be a natural use for each (e.g. LinkedList seems like a good option Stack/LIFO - we can easily add to the list by adding new objects to the front of the list, and then popping objects off the stack by removing from the head of the List - both operations O(1) - compared to the cost of adding to the front of an ArrayList that requires a lot of copying ).

However, in the ArrayDeque implementation it is a circular array - so no copying is required and add/remove is a constant O(1), and the LinkedList implementation creates a very slight performance overhead by using additional memory creating "nodes" for each object in the list.

In Java, arrays and lists are an ordered collection of non-unique elements. They are probably one of the most common data structures you might have come across in your day-to-day programming.

Array vs List

In most cases, in Java you are more likely to use Lists over arrays - Lists provide more functionality as part of the API than the array does, so given the option, most people will use a list.

The two most common use cases for an array in java are:

An array of primitive types - Java generics only supports object references. Although Java autoboxing reduces the need for this, as you can still insert int type values into List<Integer> for example.
Micro optimization in performance critical systems

Most other cases, people generally use Lists, as the List interface offers more/convenient functionality, and also offers further control over type of List:

ArrayList

Array list is a simple array based implementation of the List interface.

Performance

add( T item ) - Adding a single element to an ArrayList using this add method will just add the element at the end of the list, which is very cheap - O(1)

add( T item, int index) - Adding a single element to an ArrayList at a specified position is less performant, as it needs to copy all elements to the right of the specified position, so this is more expensive and runs in linear time - O(n)

remove( T item ) - Similar to adding at a specified position, this is less performant as it involves an array copy (plus, if we remove a specific Object rather than an item at a given position, it still potentially needs to access all items in the list). Again, linear time - O(n)

set/get( int index ) - This is very cheap in an arraylist, as it is just backed by an array, so can be looked up in constant O(1)

LinkedList

LinkedList in Java is a double linked list implementation of the List interface (e.g. every element in the list stores a pointer to the previous & next elements in the list - and these pointers are used to access/traverse the list).

Performance

add( T item ) - Adding a single element to a LinkedList using this add method will also just add the element at the end of the list same as the ArrayList, which is very cheap - O(1)

add( T item, int index) -Again, adding at a given position (** using this method!) is more expensive - Insertion in a linked list is cheap, as the pointers just need to be adjusted, however, you have to traverse the list to find the position, which puts you back into running in linear time - O(n)

remove( T item ) - Again, this method has the same performance/issues as the above add at position method, having to traverse the list to find the element to remove. So again, runs in linear time - O(n)

set/get( int index ) - As we need to traverse the list to find the position, it also needs to potentially traverse every element in the list, so runs at linear time O(n)

** As you will note from the above summary, ArrayList is better for get & set methods and equal performance for add remove methods. However, LinkedList does have the benefit of being able to use an Iterator to add/delete elements in constant O(1) time - e.g. if you are iterating a list and already at the position you wish to insert/delete then it is very cheap - see JavaDocs

Conclusion

By and large, for most simple List cases (not considering wanting to use Sets, Queues, Stacks etc) the most common choice is ArrayList as it offers, generally, the best performance.

If you know you need to have a very large list, and know that you will always be inserting new elements towards the head of the list, then LinkedList may be a better alternative - if you only ever insert at the start of a list, LinkedList will perform in constant O(1) time, where as ArrayList will perform O(n) time.

There are futher implementations of the List interface in Java, such as Stack, Vector. I will look at some of those in different posts.

I have just signed up for the Coursera Algorithm Design & Analysis course from Stanford. I have been meaning to take it up for a while but kept missing when it was running. As it happens, this time I have all but missed it, with the final assessments due already, but I have signed up and am going through the material and lectures on my own.

I am only on week one so far, but it is all good - the lecturer seems good and articulates the content well (so far the presentation and sophistication of the course is better than the Scala & FP course and the Logic courses I have previously done on Coursera).

After I completed the Functional Programming & Scala course, I wrote up some notes on some of the techniques and problems in Groovy and thought it might be fun to write up some groovy notes from the algos course.

Week one is covers an introduction to algorithm analysis, divide and conquer (with merge sort) and some other stuff.

Mergesort - Analysis

Mergesort has three key steps

Split the list in to two halves
Recursively sort the first half
Recursively sort the second hald
Merge the two sorted halves

This is a well known example of the divide&conquer paradigm, the recursive sorting continues to recurse and divide the list into two halves until sorting is trivial (e.g. base case of one item in each list).

Let's look at the merge step:

(pseudocode taken from the course)
result = output [length = n]
leftList = 1st sorted array [n/2]
rightList = 2nd sorted array [n/2]
i = 1
j = 1
for k = 1 to n
if leftList(i) < rightList(j)
    result(k) = leftList(i)
    i++
else [rightList(j) < leftList(i)]
    result(k) = rightList(j)
    j++
end

The above should be fairly straight forward to understand:

result stores our final merged list, and will be length n (number of elements in starting array)
leftList/rightList - these are the two inputs to the merge, and are simple the two halves of the list that need to be merged. This assumes that both lists are already sorted prior to merging
i/j - These are pointers to what we consider the "head" of each list. Rather than removing items from lists we will simply keep track of a pointer for each list to track how many items of each list have been merged
We then iterate through all "n" items, comparing the "head" of our two lists, the smaller value getting added to our result list

Complexity analysis of merge step:

In the course, the lecturer states that the worst case running time = 4n + 2 which he then further simplifies to = 6n (this simplification is simply because n must always be at least 1, so 4n + 2 will always be no worse than 6n.

Let's look at how we get to that figure (we will then later look at what the asymptotic run time is). This figure is simply the number of operations that need to be executed when this algorithm is run so let's count those:

result = output [length = n]
leftList = 1st sorted array [n/2]
rightList = 2nd sorted array [n/2]
i = 1
j = 1
In the above, the only operations that are actually set are i & j (the others are just describing the function input/output). So that costs us 2 (constant regardless of how big "n" is)

for k = 1 to n
if leftList(i) < rightList(j)
    result(k) = leftList(i)
    i++
else [rightList(j) < leftList(i)]
    result(k) = rightList(j)
    j++
end

In the above, we have one condition checking/incrementing "k" in the for statement (executed every iteration, so this will be done "n" times); we then have an IF conditional check, this will also be done in every iteration (so this is another "n" operations); then depending on which branch of that condition is executed, there are always a further two operations executed (the assignment of the head to result and the increment of the head pointer) - another two operations executed "n" times.

So the above block executes 4 operations for each of the "n" iterations. The first block executes 2 operations =

4n + 2

Seems straight forward enough. Really though, this is just a fairly simplified piece of pseudo code, and there is a bunch of additional code that is needed to handle other stuff, edge cases etc (e.g. what happens if you have added all of the left list to the result and only have the right list left? obviously that can just be appended to the results, but the above code doesn't handle this). Then there are further questions about different language implementations that might have more operations to do this (not to mention difference when it gets compiled down to machine code) - we will get more into this later when we talk about asymptotic performance. For the time being we will just say the run time (worst case) for merging two lists is 4n + 2.

Now, let's look at recursively sorting..

This part of the algorithm is also relatively simple, we just keep recursing through the list, splitting the list in to two halves until each half of the list is just one element (one element is already sorted!), then we can merge those lists easily.

So, if we are recursing through and splitting the list into two halves, how many times will we have to perform the merge?

Let's step through this, with an example input list of 8 elements (n=8)

One list of 8 elements
Two lists of 4 elements
Four lists of 2 elements
Eight lists of 1 elements

this is just a basic binary tree - see the illustration below taken from wikipedia:

As you can see in the above tree, it splits down into individual elements, then it re-combines them using the merge technique. To work out the complexity of the algorithm we need to know two things:

How deep is our tree going to be?
What is the runtime of a given depth in the tree (e.g. for a given depth "j" in the tree, what is the cost of re-merging all the lists at that level

The first question is relatively easy, as it is a binary tree we know the depth is simply going to be log2(n)+1 - How do we know that? the log of a number is simply how many times it needs to be divided by 2 (assuming 2 is the logarithm base) before the number is <= 1 - which is exactly what we are doing when we are recursively dividing our lists in half until we reach the single element lists.

E.g. the log2(8) = 3 (8/2=4; 4/2=2; 2/2=1)

So now lets calculate the running time for any given level of the tree. Lets consider a list of length "n" and we want to know a few things:

How many merge operations do we need to perform for depth "j" of a list "n"?
In a similar way to how we calculated the depth, we can work out the number of merges (or number of leaves at the level of the tree). If you have halved the lists each step 0..j then the number of leaves will be 2 to the power j.

E.g. The first level down, after just halving the original list, we have j=1, so expect 2 power 1 = 2 leaves.
Second level down, we again halve each of those two lists, we have j=2, so expect 2 power 2 = 4 leaves

Note that j cannot exceed the depth of the tree, which we already know.

What is the complexity of each of these merge operations?
We know the complexity of a merge from our earlier analysis of the merge step as 6m (where m is the size of the halved lists - not the original value "n" of the starting list size), and we know we have 2 to the power j of these merges, which means we know at any level of the tree, the run time performance is:

(2 power j) * (6m)

Now the value of "m" will also be dependent on the value of "j" - because as we get further down the tree, we have a greater number of merges to perform, but the size of the lists are also getting smaller (so the actual value of "m" is decreasing), so what is the value of 6m when we are at level "j" of the tree? We can work that out in a similar fashion - we know the original "n" (size of starting list) has been halved "j" times - so it is simply

(n)/(2 power j)

So now we can combine those and resolve the algebra.. we can see that the 2 power j cancels each other out (intuitively really, as we are increasing the number of merges, but at the same time and at the same rate reducing the size of the lists to be merged:

(2 power j) * 6 * (n~~/(2 power j)~~) = 6n

What is the overall complexity of the tree?
So we now already know the complexity of any given level of the tree (independent of the depth) and we know the depth of the tree (number of levels):

6n * (log2(n)+1) = 6nlog2(n) + 6n

So there we have it - we have calculated the worst case running time for the merge sort.

There are a few further things that we will need to consider in later posts: In one post I will have a quick look at asymptotic analysis (e.g. "Big-Oh" notation) and what this means; in another post I will look at a Groovy implementation of it and the analysis of that.

Thoughts on development & design

Graph data structures - Searching

Depth First Search (DFS)

Breadth First Search (BFS)

Graph data structures - An introduction

The basics

Graph representations

Graph representation - Java

Quick Sort - A Java implementation

MergeSort - A Java implementation

Tech cheat sheets - Maps

HashMap

The hashCode() method

Performance

Tech cheat sheets - Stacks & queues

Deque

ArrayDeque vs LinkedDeque

Tech cheat sheets - Lists & arrays

Array vs List

ArrayList

Performance

LinkedList

Performance

Conclusion

Algorithm analysis & complexity: MergeSort - part 1

Mergesort - Analysis

ROB HINDS

Popular Posts

Blog Archive

DZone MVB