Showing posts with label programming. Show all posts

Graph data structures - An introduction

With most problems it is largely pretty clear which data structure is going to be appropriate to use - If you just care about storing a list and iterating over it, then you probably want an Array based structure - if you particularly care about the elements being unique you could look as a Set. If you want to capture a key-value pair dictionary data structure then you can use a Map.

Graph data structures are no different, and are very applicable to a subset of problems, and once familiar with graphs and the common algorithms it becomes quite easy to quickly identify the problem type as a graph problem.


The basics

A graph has two main elements:
  • Node - a given data point in the graph
  • Edge - a connection that joins any two Nodes. A graph can be "directed" or "un-directed" - this simply determines whether the Edge goes both ways or is purely one way. 

A graph is a data strucutre that stores a set of connected elements - the easiest way to understand the structure is with a real world example, the most famous is probably like the social-network graph. If you think about your profile on any popular social network (Facebook/LinkedIn/etc), your profile is a Node in the graph, and each of your friendship/connections is an Edge to another Node in the graph

A while ago, a Facebook intern created a visualisation of the Facebook social graph around the world - you can't really make out the individual Nodes/Edges, but you get the idea.


In Facebook's graph, it is un-directed, that is, when you become friends with another Node in the graph the relationship goes both ways - you are their friend and they are yours.

Twitter, however, is a directed graph - once you follow someone an Edge is created between your Node and theirs, but they don't automatically follow you as well, so the Edge has direction.


If you are a LinkedIn user, you may have noticed whilst browsing another user's profile a widget saying something like

"X of your connections can introduce you to someone who knows Y"


What LinkedIn is telling you, is that the shortest path between your Node and Y's Node in the graph is 3 Edges (3 "hops" - following an Edge between nodes is often called a "hop") - To be able to do this, LinkedIn is searching the social graph space to discover the shortest path (and number of unique paths that are of the shortest distance) between you and this other user.


Graph representations

There are two primary graph representations:
  • Adjacency Matrix - This is a matrix/2-D array that captures the relationship between every node. Every node is mapped against the X and Y axis, and the value in the intersecting cell determine if their is an Edge between the nodes. E.g. if we wanted to know if there was an edge between Node "4" and node "13" we would look at matrix[4][13] - the value there would tell us. Normally, value of 1 represents an edge, but other values can be used (for example, if it is a weighted graph the values could represent the weights, or if it is a directed graph it could use -ve/+ve values to represent direction).  This representation is good for "dense" graphs.
  • Adjacency List - This representation is simply a List of all Edges and a List of all Nodes. This is a simple representation and is more memory efficient for "sparse" graphs

For the code samples here, I will focus on the Adjacency List representation


Graph representation - Java

Below is a very rudimentary implementation of a Graph class in Java. It uses a Adjacency List representation and will be used in later examples I go through.


Below is a sample unit test setup that shows how a simple graph can be initialised:



In the next post I will go through basic techniques for searching and exploring graph spaces, as well as a post looking at how to solve the LinkedIn shortest path recommendation problem.





Traversing data structures - A Groovy Visitor Implementation

I have been using Groovy for a while now, having come from a solid Java/J2EE/Spring/ORM background where patterns and solid OO is a mainstay.  Although it took me a bit to get in to the swing of the Groovy stuff, I have now really taken to it (in no small part, encouraged having taken the Coursera FP class) - The simplicity and ease to use the built in functional stuff like .each{}, .findAll{}, .collect{} is really neat, and once you get started with them there really is no stopping!

As Groovy is dynamically typed (not weakly typed though!) you do end up using a lot less POJO classes and a lot more Maps, Lists of Maps, etc (also Groovy makes it really easy to work with these guys) and I find myself fairly often needing to traverse complex, dynamically typed data structures to do some kind of data processing.  Normally, in my Java OO background, when frequently processing tree type structures (a nested Map of Lists/Maps/Simple elements can really be thought of taking this form) I would fall back to the Visitor patter (if you aren't familiar with the GoF Visitor pattern, see here, here, etc for details), but if you have a dynamically typed complex data structures, and you don't really want to have each potential node in your structure to have to implement a set interface with a visit() method on it, then I use the following approach.

(disclaimer: yes, it feels as though it is a hacky, dodgy approach of the pattern - but it works well with Groovy and has been working well for me. If you have ideas on how to improve etc I would love to hear thoughts in the comments!  As such, I like to refer to it as "The Unwelcome Visitor Pattern").

First, I create an iterator class - this basically has the code to iterate through a nested, complex structure - I see this as boiler plate code that is a pain to re-write and will be used by all code that wants to traverse a complex data structure (Map with n-level deep nested Lists/Maps)



As you can see, its just a basic re-cursive piece of code to traverse List/Maps - you will note that it expects a visitor class to be passed in to it as an argument on first calling that has a visitMap() and visitList() methods.  In normal Java, this would need to be a class that implements a particular Interface/Abstract class that has implementations of the required methods. However, as Groovy is a little more dynamic we can do some pretty nice on the fly stuff (yes, I know, if you are performing some really common stuff, you may still want to have the traditional Java interface/explicit class approach as well, but that's not why we are here!).  The code below is an example of doing some on-the-fly processing of a dynamically typed complex data structure (in this case, we are just converting all Date objects to Strings, but this is just an example for funsies)



As you can see, in the above we are using Groovy's ability to create Interface implementations as Maps and just defining a simple closure for each of the visitMap() visitList() methods.


It may not be the most graceful solution, but it works simply and allows easy definition of closures that can process Maps/Lists easily (could also be used in the same way fo rtraversing JSON structures etc)