SlideShare a Scribd company logo
Link Analysis in
Networks
- or Finding The Terrorists

Friday, 15 November 2013
About James
Mathematician turned
Computer Scientist

lives in London, UK

talks fast
Works for cisco
bad at blogging

Friday, 15 November 2013
Objectives
What is link analysis
history lesson
graph theory basics
network theory concepts
link analysis basics
link analysis in the wild
getting started with link analysis
Friday, 15 November 2013
What is link Analysis (1)
Which nodes are key or central to the network?
Which links can be severed or strengthened to
most effectively impede or enhance the operation
of the network?
Can the existence of undetected links or nodes be
inferred from the known data?
What types of structured groups of entities occur
in the data set?
Friday, 15 November 2013
What is link Analysis (2)
What are the relevant sub-networks within a
much larger network?
Are there similarities in the structure of subparts
of the network that can indicate an underlying
relationship (e.g., modus operandi)?
What data model and level of aggregation best
reveal certain types of links and sub-networks?
Friday, 15 November 2013
Organised Crime
vs
Terrorism

Friday, 15 November 2013
History

Friday, 15 November 2013
G uns, Drugs & Gangs
? - pirates, gangs, bandits and highway-robbers
4BC - Goths and Vandals
1800s - Yakuza, Triad, Mafia, Mafiya
1920s+ - La Cosa Nostra, cartels, ethnocentric
gangs and syndicates, IRA
1970s - ETA
1990s - Al-Qeada
2000s - Anonymous
Friday, 15 November 2013
Japan's three biggest banks
face yakuza links inquiry
Loans to mobsters scandal at
Mizuho prompts wider
investigation into Mitsubishi UFJ
and Sumitomo Mitsui groups

http://www.theguardian.com/world/2013/oct/30/japan-three-biggestbanks-yakuza-links-inquiry
Friday, 15 November 2013
0th Generation

Friday, 15 November 2013
1st Generation
Generally
accepted first
formalisation
was in 1975
with the
Anacpapa
Chart of
Harper and
Harris
Friday, 15 November 2013
2nd Generation
GUI software that essentially replicated the
manual and hand-drawn 1st generation tools,
notably:

• i2
• Netmap
• Crimeflow
Due to automated computation information could
be updated in real-time
Still often requires a domain expert
Friday, 15 November 2013
2nd Generation

Friday, 15 November 2013
3rd Generation
do not require domain experts for usage
aggregate sources - most data is digitised now
rich meta-data models
improved computational power and algorithms
billions of nodes and relationships

Friday, 15 November 2013
Deduction
vs.
Inference

Friday, 15 November 2013
Graph Theory
the basics

Friday, 15 November 2013
Defn 1: Undirected Graph
an undirected graph, G, is an ordered pair G(V, E)
where
V is a set of objects called vertices
E is the set of 2-element subsets of V called
edges
If E does not contain e(v1, v2) such that v1 = v2
then G is a simple graph
Friday, 15 November 2013
Example
V = { london, paris, amsterdam, madrid }
E = { {london, paris}, {paris, amsterdam},
{paris, madrid} }

Friday, 15 November 2013
Defn 3: Labels
A label is some value, e.g integer, colour,
enumeration
An edge-labelled graph is one where some or all
of the edges have labels
A vertex-labelled graph is one where some or all
of the vertices have labels
A labelled graph maybe edge-labelled, vertexlabelled, or both
Friday, 15 November 2013
Defn 2: Directed Graph
a directed graph, G, is an ordered pair G(V, E)
where
V is a set of objects called vertices
E is the set of ordered 2-element subsets of V
called edges
For a vertex v the in-degree is the number of
edges in E that end at v. The out-degree of v is
the number of edges that start ar v
Friday, 15 November 2013
Example

Credit to scificat @
deviantart and Sheldon from
the big bang theory
Friday, 15 November 2013
Example
V = { rock, scissors, paper, lizard, spock }
E={
{rock, scissors}, {rock, lizard},
{scissors, paper}, {scissors, lizard},
{paper, rock}, {paper, spock},
{lizard, paper}, {lizard, spock},
{spock, rock}, {spock, scissors}
}
Friday, 15 November 2013
Defn 3: Multigraph
a multigraph, G, is an ordered pair G(V, E) where
V is a set of objects called vertices
E is the multiset of 2-element subsets of V called
edges
if the elements of E are ordered pairs then G is a
directed multigraph

Friday, 15 November 2013
Friday, 15 November 2013
Defn 4: Subgraph
given a graph G(Vg, Eg) a graph H is a subgraph
H(Vh, Eh) iff Vh < Vg and Eh < Eg
if Vh = Vg then H is a spanning subgraph of G

Friday, 15 November 2013
Defn 4: Walks
given a graph G(V, E) a walk W is a sequence of
edges from E s.t. for any adjacent elements
wi = (vr, vs), wi+1 = (vt, vw) then vs = vt

If a walk begins & ends on the same vertex it is
a closed, otherwise it is open
Friday, 15 November 2013
Defn 4: Cycle
A closed walk is called a cycle. A cycle must have
length greater than 0.

Defn 4: Cyclic & Acyclic
a graph g is said to be acyclic iff there is no
subgraph which is a cycle graph
Friday, 15 November 2013
Defn 4: Complete Graph
A graph G(V, E) with |V| = n is a complete graph Kn
if for every vertex vi there exists an edge (vi, vk)
in E for k = 1..n, and i ≠ k

Defn 4: Cliques
Given a graph G(Vg, Eg) and a subgraph H(Vh, Eh),
|Vh| = k, if H is a complete graph then H is a clique
of order k, or a k-clique
Friday, 15 November 2013
Examples

Friday, 15 November 2013
Defn 5: Strongly Connected
A graph G is strongly connected iff for every pair
of vertices {vi, vj} in G there exists a path which
starts at vi and ends at vj
Given a graph G and a subgraph H, if H is
maximally strongly connected we call H a strongly
connected component of G
Friday, 15 November 2013
Network Theory
basic Concepts

Friday, 15 November 2013
Communities
A network is said to have community structure if
the nodes can be grouped into (potentially
overlapping) subgraphs such that each is densely
connected.
Methods for finding communities:
minimum-cut method
hierarchial clustering
Girvan-newman algorithm
modularity maximisation
clique analysis
Friday, 15 November 2013
Small Worlds
A small-world network is a graph G(V, E) where
the average minimum path length between any
two vertices is L where
L α log |V|
Small-worlds are typically comprised of cliques
and near-cliques
Friday, 15 November 2013
Random Graphs
Erdős and Renyi studied properties of random
graphs in 1959
A random graph G is a graph G(V, E) where the
probability an edge (vi, vj) exists is given by p
=> the average degree k is approx. p * |V|

Friday, 15 November 2013
Friday, 15 November 2013
Friday, 15 November 2013
if k < 1
small isolated clusters
small diameters
short average path lengths
if k = 1
one dominant cluster appears
diameter peaks
high average path lengths
if k > 1
approaches single strongly connected component
diameter decreases
average path lengths decrease
Friday, 15 November 2013
If the relationships between people in the real
world can be modelled by a random graph then
because the average person knows more than 1
other (k >> 1) then the majority of people are
connected by short paths

Friday, 15 November 2013
If the relationships between people in the real
world can be modelled by a random graph then
because the average person knows more than 1
other (k >> 1) then the majority of people are
connected by short paths

Friday, 15 November 2013
Alpha Model
Watt (1998) proposed the α-model of networks
The α-model corrects the following in the random
model:
Relationships generally aren’t random
Relationships are often “tit for tat”
Relationships usually form clusters

Friday, 15 November 2013
Friday, 15 November 2013
Beta Model
The α-model is a significantly better model of
real world network but it too has limitations
Primary limitation is that the chance of distant or
random connections is unrealistically low
Watts and Strogatz (1999) propsed the β-model
to correct this
For a range of value of β these networks exhibit
“small world” properties
Friday, 15 November 2013
Scale-Free Networks
Discovered in 1965 but little interest until 1999
when realised how accurately they modelled many
real-world networks
Consider a random graph with the following
degree distribution depending on two values α and
β. Suppose there are y vertices of degree x
where x and y satisfy
log y = α - (β log x)
Friday, 15 November 2013
Power Law Distribution

Friday, 15 November 2013
Random vs. Scale-Free

Random
graph
Friday, 15 November 2013

Scale-free
graph
Scale-Free Properties
Scale-free graphs are small-worlds
The number of vertices with higher degree than
the average is very common
Such vertices are called hubs
Primary hubs are supported by secondaries,
tertiary, etc
Thus scale-free networks are fault-tolerant
Vertices tend to form communities with hubs
providing inter-community connection
Friday, 15 November 2013
Link Analysis
an introduction

Friday, 15 November 2013
Link analysis and network theory provide
techniques for analysing structure in a system of
interacting agents, represented as a network
Most well known examples are web search
engines:
HITS (Hypertext Induced Topic Search) - ask.com
PageRank - Google
TrustRank - Yahoo!
Friday, 15 November 2013
Matrices
row vector (1 x n)

square matrix
(2 x 2)

Friday, 15 November 2013

column vector
(n x 1)

rectangular matrix
(2 x 3)
Matrix Operations
addition

scalar multiplication

transpose

Friday, 15 November 2013
Matrix Multiplication
Given an n*m matrix P, and an m*o matrix Q then
PQ is the n*o matrix where each element PQij is
given by
for example:

Friday, 15 November 2013
Noncommutative:
Associative:
Distributive over matrix addition:
Scalar multiplication is associative over matrix
multiplication:
Transpose:
Friday, 15 November 2013
Eigenvectors
given a square n*n matrix A and a non-zero nvector v, v is a (right) eigenvector of A iff

we call λ an eigenvalue.
Eigenvalues & eigenvectors can be real or complex

Friday, 15 November 2013
Example

Friday, 15 November 2013
Incidence Matrix
Rock Scissors Paper Lizard Spock
Rock

0

0

1

0

1

Scissors

1

0

0

0

1

Paper

0

1

0

1

0

Lizard

1

1

0

0

0

Spock

0

0

1

1

0

Friday, 15 November 2013
Other Representations
Adjacency matrix
Incidence list
Adjacency list
Edge lists
Topological distance matrix

Friday, 15 November 2013
Centrality
Centrality is a measure of how luminous a given
vertex in a graph is
In an undirected graph centrality measures
consider all edges
In a directed graph centrality measures consider
only out-edges

Friday, 15 November 2013
For a graph G(V, E), and P the set of all paths in
G we define:
Degree-centrality

Closeness-centrality

Betweenness-centrality

Friday, 15 November 2013
Prestige
Prestige is a measure of how visible a given
vertex in a graph is. In undirected graphs we
consider all edges, but in directed graphs prestige
considers in-edges only
Similar to centrality metrics, we have degreeprestige, proximity prestige and rank prestige
Friday, 15 November 2013
PageRank
“PageRank works by counting the number
and quality of links to a page to determine
a rough estimate of how important the
website is. The underlying assumption is
that more important websites are likely to
receive more links from other websites”
[Facts about Google and Competition]
Friday, 15 November 2013
Simple PageRank
Given the web is a graph G(V, E) where |V| = n,
i.e. n pages, the PageRank of page vi is

and the initial rank of page vi is

Friday, 15 November 2013
Given the adjacency matrix M of a graph
G(V, E) we construct the hyperlink matrix H
such that

Note that H is normalised, i.e each column
sums to 1, and all entries are non-negative.
H is said to be a stochastic matrix
Friday, 15 November 2013
The PageRank vector R of a graph G(V,E)
with hyperlink matrix H is given by

That is PageRank is the primary eigenvector
of H. We can iteratively calculate R using
the power method

Friday, 15 November 2013
Example

Friday, 15 November 2013
Friday, 15 November 2013
Friday, 15 November 2013
Friday, 15 November 2013
Simple PageRank Issues
Dangling pages
Orphan pages
Cycles
Rank sinks
Sensitivity to initial PageRank vector
Friday, 15 November 2013
Real PageRank
use sparse matrix representations
up to 3 billion rows and columns
if probability of teleportation is > 0.15
PageRank converges in less than 100
iterations
may use alternative to “random surfer”
model
Friday, 15 November 2013
Link Analysis in
the Wild

Friday, 15 November 2013
How The NSA Works
The methods used by the NSA Prism project
include:
Blah blah blah blah blah
Blahblahblah and by the name of
Blah blah blah blah blah blah balh
and trained black-ops dolphins

Friday, 15 November 2013
Knowledge Mash-Ups
Multiple data sources
full text search
social graphs
telephone, email & browsing history
Representations may not be appropriate for
analysis
Data may need to be transformed and managed
using non-relational data structures
Important to remember non-mathematicians
analysts prefer to work with visual
Friday, 15 November 2013
TerroristRank
TerroristRank works by counting the number and
quality of links to a person to determine a rough
estimate of how important the person is. The
underlying assumption is that more important
terrorists are likely to receive more links from
other terrorists

Friday, 15 November 2013
How to Find a Terrorist
Given a graph of actors and their interactions
determine the communities
extract the subgraph of communities containing
the actors of interest
calculate the “terrorist rank” of the subgraph
actors with the highest ranks are “suspects”

Friday, 15 November 2013
Limitations
typically graph algorithms are non-linear in time
and/or space complexity
adding new nodes & edges can have a dramatic
impact
real world networks are often dynamic
metrics like rank must be constantly recalculated

Friday, 15 November 2013
Issues
Information overload
data mining maybe incomplete or find false positive
relationships
intentional or subconscious human filtering of data
sources
malicious data (e.g SEO of malware by link sites)
changes in alleigence (“when good goes bad”)
Friday, 15 November 2013
Deduction
Prosecution
vs.
Inference
Prevention
Friday, 15 November 2013
Getting Started
with Graph
Algorithms

Friday, 15 November 2013
Matlab
a mathematical workbench
aimed at scientists & mathematicians not
developers
has graph algorithms “plugin” gaimc
(generally) slower than native code
has Apis for bi-directional integration with Java
good for learning the mathematics behind
algorithms
Friday, 15 November 2013
Cern Colt
http:/
/acs.lbl.gov/software/colt/
high performance data structures and operations
collections (including primitive templated lists & maps)
matrices
linear algebra
mathematics & statistics
random sampling & number generation
documentation a bit hit and miss
api not always natural to java programmers
Friday, 15 November 2013
Jung 2.0
http:/
/jung.sourceforge.net/
pure java graph algorithm library
OO Api
uses cern colt library for matrix representations and
operations
uses in-memory storage
performance limitations are generally related to this
easy to extend
requires good understanding of algorithms and internals to
maximise performance
Friday, 15 November 2013
Neo4j
transactional property graph database (acid compliant)
property graphs are:
labelled directed multigraphs
both vertices and edges can have any number of key/value
properties associated with them.
numerous in-built algorithms
powerful flexible query language
allows implementation of algorithms
Community version limits total number of nodes
excellent spring integration
Mark needham http:/
/www.markhneedham.com/blog/
Friday, 15 November 2013
Alternatively
Hadoop & map-reduce
gremlin - a groovy based graph DSL
scala-graph - young & operator-overloading hell
R programming language

Friday, 15 November 2013
Summary

Friday, 15 November 2013
Objectives Review
What is link analysis
history lesson
graph theory basics
network theory basics
link analysis basics
link analysis in the wild
getting started with link analysis
Friday, 15 November 2013
Link analysis is just one tool in the box for extracting
information from graphs
much of the skill lies in:
filtering the raw graph to prevent information overload
pruning the graph to allow expensive algorithms to
compute an effective answer
work iteratively; start by extracting simple data from a
graph before trying, say, community analysis
understand enough of the underlying mathematics to
choose the right tool for the job
Friday, 15 November 2013
Thank You

Friday, 15 November 2013

More Related Content

PDF
Python networkx library quick start guide
PDF
Spectral clustering with motifs and higher-order structures
PDF
Networkx tutorial
PDF
SocialCom 2013
DOCX
Web Services-Enhanced Agile Modeling and Integrating Business Processes
DOCX
Link analysis .. Data Mining
PPTX
Data Mining: Text and web mining
PPTX
Data mining
Python networkx library quick start guide
Spectral clustering with motifs and higher-order structures
Networkx tutorial
SocialCom 2013
Web Services-Enhanced Agile Modeling and Integrating Business Processes
Link analysis .. Data Mining
Data Mining: Text and web mining
Data mining

Viewers also liked (8)

PPTX
Network properties
PPTX
Link Analysis and Link Building in a Penguin and Disavow World
PDF
LinkSUM: Using Link Analysis to Summarize Entity Data
PDF
Tutorial 7 (link analysis)
PPT
Sample - Extending IBM i2 Analysis with G2 Research
PDF
PDF
How to Combine SEO, Blogging, and Social Media For Results HubSpot
PPTX
Three Steps to Link Analysis Insight
Network properties
Link Analysis and Link Building in a Penguin and Disavow World
LinkSUM: Using Link Analysis to Summarize Entity Data
Tutorial 7 (link analysis)
Sample - Extending IBM i2 Analysis with G2 Research
How to Combine SEO, Blogging, and Social Media For Results HubSpot
Three Steps to Link Analysis Insight
Ad

Similar to Link Analysis in Networks - or - Finding the Terrorists (20)

PDF
Network analysis for computational biology
PDF
N0747577
PDF
Trace Complexity of Network Inference
PDF
Link prediction
PPT
Prim's Algorithm on minimum spanning tree
PDF
Graph Analyses with Python and NetworkX
PDF
Minicourse on Network Science
PDF
Topics of Complex Social Networks: Domination, Influence and Assortativity
PDF
Topics of Complex Social Networks: Domination, Influence and Assortativity
PPTX
Graph in data structures
PDF
Lausanne 2019 #4
PPTX
data structures and algorithms Unit 2
PDF
An Application of Gd-Metric Spaces and Metric Dimension of Graphs
PPTX
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx
PPT
barrera.ppt
PPT
barrera.ppt
PPTX
Dagstuhl seminar talk on querying big graphs
PDF
Interactive Knowledge Discovery over Web of Data.
PDF
Short version of Dominating Sets, Multiple Egocentric Networks and Modularity...
Network analysis for computational biology
N0747577
Trace Complexity of Network Inference
Link prediction
Prim's Algorithm on minimum spanning tree
Graph Analyses with Python and NetworkX
Minicourse on Network Science
Topics of Complex Social Networks: Domination, Influence and Assortativity
Topics of Complex Social Networks: Domination, Influence and Assortativity
Graph in data structures
Lausanne 2019 #4
data structures and algorithms Unit 2
An Application of Gd-Metric Spaces and Metric Dimension of Graphs
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx
barrera.ppt
barrera.ppt
Dagstuhl seminar talk on querying big graphs
Interactive Knowledge Discovery over Web of Data.
Short version of Dominating Sets, Multiple Egocentric Networks and Modularity...
Ad

Recently uploaded (20)

PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Encapsulation theory and applications.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
August Patch Tuesday
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Getting Started with Data Integration: FME Form 101
cloud_computing_Infrastucture_as_cloud_p
Encapsulation_ Review paper, used for researhc scholars
A comparative analysis of optical character recognition models for extracting...
SOPHOS-XG Firewall Administrator PPT.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Heart disease approach using modified random forest and particle swarm optimi...
Digital-Transformation-Roadmap-for-Companies.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Univ-Connecticut-ChatGPT-Presentaion.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Spectroscopy.pptx food analysis technology
Encapsulation theory and applications.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
TLE Review Electricity (Electricity).pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
MIND Revenue Release Quarter 2 2025 Press Release
August Patch Tuesday
NewMind AI Weekly Chronicles - August'25-Week II
Getting Started with Data Integration: FME Form 101

Link Analysis in Networks - or - Finding the Terrorists

  • 1. Link Analysis in Networks - or Finding The Terrorists Friday, 15 November 2013
  • 2. About James Mathematician turned Computer Scientist lives in London, UK talks fast Works for cisco bad at blogging Friday, 15 November 2013
  • 3. Objectives What is link analysis history lesson graph theory basics network theory concepts link analysis basics link analysis in the wild getting started with link analysis Friday, 15 November 2013
  • 4. What is link Analysis (1) Which nodes are key or central to the network? Which links can be severed or strengthened to most effectively impede or enhance the operation of the network? Can the existence of undetected links or nodes be inferred from the known data? What types of structured groups of entities occur in the data set? Friday, 15 November 2013
  • 5. What is link Analysis (2) What are the relevant sub-networks within a much larger network? Are there similarities in the structure of subparts of the network that can indicate an underlying relationship (e.g., modus operandi)? What data model and level of aggregation best reveal certain types of links and sub-networks? Friday, 15 November 2013
  • 8. G uns, Drugs & Gangs ? - pirates, gangs, bandits and highway-robbers 4BC - Goths and Vandals 1800s - Yakuza, Triad, Mafia, Mafiya 1920s+ - La Cosa Nostra, cartels, ethnocentric gangs and syndicates, IRA 1970s - ETA 1990s - Al-Qeada 2000s - Anonymous Friday, 15 November 2013
  • 9. Japan's three biggest banks face yakuza links inquiry Loans to mobsters scandal at Mizuho prompts wider investigation into Mitsubishi UFJ and Sumitomo Mitsui groups http://www.theguardian.com/world/2013/oct/30/japan-three-biggestbanks-yakuza-links-inquiry Friday, 15 November 2013
  • 10. 0th Generation Friday, 15 November 2013
  • 11. 1st Generation Generally accepted first formalisation was in 1975 with the Anacpapa Chart of Harper and Harris Friday, 15 November 2013
  • 12. 2nd Generation GUI software that essentially replicated the manual and hand-drawn 1st generation tools, notably: • i2 • Netmap • Crimeflow Due to automated computation information could be updated in real-time Still often requires a domain expert Friday, 15 November 2013
  • 13. 2nd Generation Friday, 15 November 2013
  • 14. 3rd Generation do not require domain experts for usage aggregate sources - most data is digitised now rich meta-data models improved computational power and algorithms billions of nodes and relationships Friday, 15 November 2013
  • 17. Defn 1: Undirected Graph an undirected graph, G, is an ordered pair G(V, E) where V is a set of objects called vertices E is the set of 2-element subsets of V called edges If E does not contain e(v1, v2) such that v1 = v2 then G is a simple graph Friday, 15 November 2013
  • 18. Example V = { london, paris, amsterdam, madrid } E = { {london, paris}, {paris, amsterdam}, {paris, madrid} } Friday, 15 November 2013
  • 19. Defn 3: Labels A label is some value, e.g integer, colour, enumeration An edge-labelled graph is one where some or all of the edges have labels A vertex-labelled graph is one where some or all of the vertices have labels A labelled graph maybe edge-labelled, vertexlabelled, or both Friday, 15 November 2013
  • 20. Defn 2: Directed Graph a directed graph, G, is an ordered pair G(V, E) where V is a set of objects called vertices E is the set of ordered 2-element subsets of V called edges For a vertex v the in-degree is the number of edges in E that end at v. The out-degree of v is the number of edges that start ar v Friday, 15 November 2013
  • 21. Example Credit to scificat @ deviantart and Sheldon from the big bang theory Friday, 15 November 2013
  • 22. Example V = { rock, scissors, paper, lizard, spock } E={ {rock, scissors}, {rock, lizard}, {scissors, paper}, {scissors, lizard}, {paper, rock}, {paper, spock}, {lizard, paper}, {lizard, spock}, {spock, rock}, {spock, scissors} } Friday, 15 November 2013
  • 23. Defn 3: Multigraph a multigraph, G, is an ordered pair G(V, E) where V is a set of objects called vertices E is the multiset of 2-element subsets of V called edges if the elements of E are ordered pairs then G is a directed multigraph Friday, 15 November 2013
  • 25. Defn 4: Subgraph given a graph G(Vg, Eg) a graph H is a subgraph H(Vh, Eh) iff Vh < Vg and Eh < Eg if Vh = Vg then H is a spanning subgraph of G Friday, 15 November 2013
  • 26. Defn 4: Walks given a graph G(V, E) a walk W is a sequence of edges from E s.t. for any adjacent elements wi = (vr, vs), wi+1 = (vt, vw) then vs = vt If a walk begins & ends on the same vertex it is a closed, otherwise it is open Friday, 15 November 2013
  • 27. Defn 4: Cycle A closed walk is called a cycle. A cycle must have length greater than 0. Defn 4: Cyclic & Acyclic a graph g is said to be acyclic iff there is no subgraph which is a cycle graph Friday, 15 November 2013
  • 28. Defn 4: Complete Graph A graph G(V, E) with |V| = n is a complete graph Kn if for every vertex vi there exists an edge (vi, vk) in E for k = 1..n, and i ≠ k Defn 4: Cliques Given a graph G(Vg, Eg) and a subgraph H(Vh, Eh), |Vh| = k, if H is a complete graph then H is a clique of order k, or a k-clique Friday, 15 November 2013
  • 30. Defn 5: Strongly Connected A graph G is strongly connected iff for every pair of vertices {vi, vj} in G there exists a path which starts at vi and ends at vj Given a graph G and a subgraph H, if H is maximally strongly connected we call H a strongly connected component of G Friday, 15 November 2013
  • 32. Communities A network is said to have community structure if the nodes can be grouped into (potentially overlapping) subgraphs such that each is densely connected. Methods for finding communities: minimum-cut method hierarchial clustering Girvan-newman algorithm modularity maximisation clique analysis Friday, 15 November 2013
  • 33. Small Worlds A small-world network is a graph G(V, E) where the average minimum path length between any two vertices is L where L α log |V| Small-worlds are typically comprised of cliques and near-cliques Friday, 15 November 2013
  • 34. Random Graphs Erdős and Renyi studied properties of random graphs in 1959 A random graph G is a graph G(V, E) where the probability an edge (vi, vj) exists is given by p => the average degree k is approx. p * |V| Friday, 15 November 2013
  • 37. if k < 1 small isolated clusters small diameters short average path lengths if k = 1 one dominant cluster appears diameter peaks high average path lengths if k > 1 approaches single strongly connected component diameter decreases average path lengths decrease Friday, 15 November 2013
  • 38. If the relationships between people in the real world can be modelled by a random graph then because the average person knows more than 1 other (k >> 1) then the majority of people are connected by short paths Friday, 15 November 2013
  • 39. If the relationships between people in the real world can be modelled by a random graph then because the average person knows more than 1 other (k >> 1) then the majority of people are connected by short paths Friday, 15 November 2013
  • 40. Alpha Model Watt (1998) proposed the α-model of networks The α-model corrects the following in the random model: Relationships generally aren’t random Relationships are often “tit for tat” Relationships usually form clusters Friday, 15 November 2013
  • 42. Beta Model The α-model is a significantly better model of real world network but it too has limitations Primary limitation is that the chance of distant or random connections is unrealistically low Watts and Strogatz (1999) propsed the β-model to correct this For a range of value of β these networks exhibit “small world” properties Friday, 15 November 2013
  • 43. Scale-Free Networks Discovered in 1965 but little interest until 1999 when realised how accurately they modelled many real-world networks Consider a random graph with the following degree distribution depending on two values α and β. Suppose there are y vertices of degree x where x and y satisfy log y = α - (β log x) Friday, 15 November 2013
  • 44. Power Law Distribution Friday, 15 November 2013
  • 45. Random vs. Scale-Free Random graph Friday, 15 November 2013 Scale-free graph
  • 46. Scale-Free Properties Scale-free graphs are small-worlds The number of vertices with higher degree than the average is very common Such vertices are called hubs Primary hubs are supported by secondaries, tertiary, etc Thus scale-free networks are fault-tolerant Vertices tend to form communities with hubs providing inter-community connection Friday, 15 November 2013
  • 48. Link analysis and network theory provide techniques for analysing structure in a system of interacting agents, represented as a network Most well known examples are web search engines: HITS (Hypertext Induced Topic Search) - ask.com PageRank - Google TrustRank - Yahoo! Friday, 15 November 2013
  • 49. Matrices row vector (1 x n) square matrix (2 x 2) Friday, 15 November 2013 column vector (n x 1) rectangular matrix (2 x 3)
  • 51. Matrix Multiplication Given an n*m matrix P, and an m*o matrix Q then PQ is the n*o matrix where each element PQij is given by for example: Friday, 15 November 2013
  • 52. Noncommutative: Associative: Distributive over matrix addition: Scalar multiplication is associative over matrix multiplication: Transpose: Friday, 15 November 2013
  • 53. Eigenvectors given a square n*n matrix A and a non-zero nvector v, v is a (right) eigenvector of A iff we call λ an eigenvalue. Eigenvalues & eigenvectors can be real or complex Friday, 15 November 2013
  • 55. Incidence Matrix Rock Scissors Paper Lizard Spock Rock 0 0 1 0 1 Scissors 1 0 0 0 1 Paper 0 1 0 1 0 Lizard 1 1 0 0 0 Spock 0 0 1 1 0 Friday, 15 November 2013
  • 56. Other Representations Adjacency matrix Incidence list Adjacency list Edge lists Topological distance matrix Friday, 15 November 2013
  • 57. Centrality Centrality is a measure of how luminous a given vertex in a graph is In an undirected graph centrality measures consider all edges In a directed graph centrality measures consider only out-edges Friday, 15 November 2013
  • 58. For a graph G(V, E), and P the set of all paths in G we define: Degree-centrality Closeness-centrality Betweenness-centrality Friday, 15 November 2013
  • 59. Prestige Prestige is a measure of how visible a given vertex in a graph is. In undirected graphs we consider all edges, but in directed graphs prestige considers in-edges only Similar to centrality metrics, we have degreeprestige, proximity prestige and rank prestige Friday, 15 November 2013
  • 60. PageRank “PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites” [Facts about Google and Competition] Friday, 15 November 2013
  • 61. Simple PageRank Given the web is a graph G(V, E) where |V| = n, i.e. n pages, the PageRank of page vi is and the initial rank of page vi is Friday, 15 November 2013
  • 62. Given the adjacency matrix M of a graph G(V, E) we construct the hyperlink matrix H such that Note that H is normalised, i.e each column sums to 1, and all entries are non-negative. H is said to be a stochastic matrix Friday, 15 November 2013
  • 63. The PageRank vector R of a graph G(V,E) with hyperlink matrix H is given by That is PageRank is the primary eigenvector of H. We can iteratively calculate R using the power method Friday, 15 November 2013
  • 68. Simple PageRank Issues Dangling pages Orphan pages Cycles Rank sinks Sensitivity to initial PageRank vector Friday, 15 November 2013
  • 69. Real PageRank use sparse matrix representations up to 3 billion rows and columns if probability of teleportation is > 0.15 PageRank converges in less than 100 iterations may use alternative to “random surfer” model Friday, 15 November 2013
  • 70. Link Analysis in the Wild Friday, 15 November 2013
  • 71. How The NSA Works The methods used by the NSA Prism project include: Blah blah blah blah blah Blahblahblah and by the name of Blah blah blah blah blah blah balh and trained black-ops dolphins Friday, 15 November 2013
  • 72. Knowledge Mash-Ups Multiple data sources full text search social graphs telephone, email & browsing history Representations may not be appropriate for analysis Data may need to be transformed and managed using non-relational data structures Important to remember non-mathematicians analysts prefer to work with visual Friday, 15 November 2013
  • 73. TerroristRank TerroristRank works by counting the number and quality of links to a person to determine a rough estimate of how important the person is. The underlying assumption is that more important terrorists are likely to receive more links from other terrorists Friday, 15 November 2013
  • 74. How to Find a Terrorist Given a graph of actors and their interactions determine the communities extract the subgraph of communities containing the actors of interest calculate the “terrorist rank” of the subgraph actors with the highest ranks are “suspects” Friday, 15 November 2013
  • 75. Limitations typically graph algorithms are non-linear in time and/or space complexity adding new nodes & edges can have a dramatic impact real world networks are often dynamic metrics like rank must be constantly recalculated Friday, 15 November 2013
  • 76. Issues Information overload data mining maybe incomplete or find false positive relationships intentional or subconscious human filtering of data sources malicious data (e.g SEO of malware by link sites) changes in alleigence (“when good goes bad”) Friday, 15 November 2013
  • 79. Matlab a mathematical workbench aimed at scientists & mathematicians not developers has graph algorithms “plugin” gaimc (generally) slower than native code has Apis for bi-directional integration with Java good for learning the mathematics behind algorithms Friday, 15 November 2013
  • 80. Cern Colt http:/ /acs.lbl.gov/software/colt/ high performance data structures and operations collections (including primitive templated lists & maps) matrices linear algebra mathematics & statistics random sampling & number generation documentation a bit hit and miss api not always natural to java programmers Friday, 15 November 2013
  • 81. Jung 2.0 http:/ /jung.sourceforge.net/ pure java graph algorithm library OO Api uses cern colt library for matrix representations and operations uses in-memory storage performance limitations are generally related to this easy to extend requires good understanding of algorithms and internals to maximise performance Friday, 15 November 2013
  • 82. Neo4j transactional property graph database (acid compliant) property graphs are: labelled directed multigraphs both vertices and edges can have any number of key/value properties associated with them. numerous in-built algorithms powerful flexible query language allows implementation of algorithms Community version limits total number of nodes excellent spring integration Mark needham http:/ /www.markhneedham.com/blog/ Friday, 15 November 2013
  • 83. Alternatively Hadoop & map-reduce gremlin - a groovy based graph DSL scala-graph - young & operator-overloading hell R programming language Friday, 15 November 2013
  • 85. Objectives Review What is link analysis history lesson graph theory basics network theory basics link analysis basics link analysis in the wild getting started with link analysis Friday, 15 November 2013
  • 86. Link analysis is just one tool in the box for extracting information from graphs much of the skill lies in: filtering the raw graph to prevent information overload pruning the graph to allow expensive algorithms to compute an effective answer work iteratively; start by extracting simple data from a graph before trying, say, community analysis understand enough of the underlying mathematics to choose the right tool for the job Friday, 15 November 2013
  • 87. Thank You Friday, 15 November 2013