An excursion into Graph Analytics with Apache Spark GraphX

Krishna Sankar
https://www.linkedin.com/in/ksankar
June 1, 2016

1. Paco Nathan, Scala Days 2015, https://www.youtube.com/watch?v=P_V71n-gtDs
2. Big Data Analytics withSpark-Apress http://www.amazon.com/Big-Data-Analytics-Spark-
Practitioners/dp/1484209656
3. Apache Spark Graph Processing-Packt https://www.packtpub.com/big-data-and-business-intelligence/apache-
spark-graph-processing
4. Spark GarphX in Action - Manning
5. http://hortonworks.com/blog/introduction-to-data-science-with-apache-spark/
6. http://stanford.edu/~rezab/nips2014workshop/slides/ankur.pdf
7. Mining Massive Datasets book v2 http://infolab.stanford.edu/~ullman/mmds/ch10.pdf
- http://web.stanford.edu/class/cs246/handouts.html
8. http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
9. http://kukuruku.co/hub/algorithms/social-network-analysis-spark-graphx
10. Zeppelin Setup
- http://sparktutorials.net/setup-your-zeppelin-notebook-for-data-science-in-apache-spark
11. Data
- http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
- http://openflights.org/data.html
12. https://adventuresincareerdevelopment.files.wordpress.com/2015/04/standing-on-the-shoulders-of-giants.png
Thanks to the Giants whose work helped to prepare this tutorial

Agenda-Topic Time Data Description
1. Graph Processing Is Introduced 10 Min Graph Processing vs GraphDB,
Introduction to Spark GraphX
2. Basics are Discussed & A Graph Is
Built
10 Min Simple
Graph
Edge,Vertex, Graph
3. GraphX API Landscape Is discussed 5 Min Explain APIs
4. Graph Structures Are Examined 5 Min Indegree, outdegree et al
5. Community, Affiliation & Strengths
Are Explored
5 Min Connected components, Triangle et al
6. Algorithms Are Developed 10 Min PartitionStrategy (is what makes GraphParallel work),
PageRank, Connected Components; APIs to implement
Algorithms viz. aggregateMessages, Pregel
Type Signature - aggregateMessages
Algorithms with aggregateMessages
7. Case Study (AlphaGo Tweets) Is
Conducted
10 Min AlphaGo
Retweet
Data
E2E Pipeline w/real data, Map attributes to properties,
vertices & edges, create graph and run algorithms
8. Questions Are Asked & Answered 10 Min
AGENDA Titles Styled after Asimov’s ROBOT SERIES !!!

Graph Applications
§ Many applications from social network to understanding collaboration,diseases, routing
logistics and others
§ Came across 3 interesting applications, as I was preparing the materials !!
1) Project effectiveness by social network analysis on the projects' collaboration structures :
“EC is interested in the impact generated by those projects … analyzing this latent impact
of TEL projects by applying social network analysis on the projects' collaboration structures
… TEL projects funded under FP6, FP7 and eContentplus, and identifies central projects
and strong, sustained organizational collaboration ties.”
2) Weak ties in pathogen transmission : “structural motif … Giraffe social networks are
characterized by social cliques in which individuals associate more with members of their
own social clique than with those outside their clique … Individuals involved in weak,
between-clique social interactions are hypothesized to serve as bridges by which an
infection may enter a clique and, hence, may experience higher infection risk … individuals
who engaged in more between-clique associations, that is, those with multiple weak ties,
were more likely to be infected with gastrointestinal helminth parasites ... “
3) Panama papers : The leak presented them with a wealth of information, millions of
documents, but no guide to structure … (ie community detection)

Graph based systems
§ Graphs are everywhere – web pages, social
networks, people’s web browsingbehavior,
logisticsetc.
- Workingw/ graphshas become more
important. And harder due to the scaleand
complexityof algorithms
§ Two kinds– processingandquerying
§ Processing– GraphX, Pregel BSP, GraphLab
§ Querying– AllegroGraph,Titan, Neo4j, RDF
stores.
- SPARQL query language
§ Graph DBs have queries, canprocess
large dataset
§ Graph processingcanrun complex
algorithms
§ For Graph Analyticssystemsboth are
required
- Processon graphx, store in neo4j is a
good alternative.
- E.g. Panamapapers analytics

Graph Processing Frameworks
§ History – Graph based systems were specializedinnature
- Graph partitioningis hard
- Primitives of record based systemswere different than what graphbased systems
needed
§ Rapid innovationin distributed data processingframeworks– MapReduce etc
- Disk based, with limited partitioningabilities
- Still an impedancemismatch
§ Graph-parallelover Data ParallelRDD ! – Best of both worlds ?
ü We will see, later, GrapnX’s PartitionStrategy

Enter Spark§ More powerful partitioning mechanism
§ In-memory system makes iterative processing
easier
§ GraphX – Graph processing built ontop of Spark
§ Graph processing at scale (distributed system)
§ Fast evolving project
§
§
-
-
§
§
§

GraphX …§ Graph processing at Scale - “Graph Parallel System”
§ Has a rich
- Computational Model
- Edges, Vertices, Property Graph
- Bipartite & tripartite graphs (Users-Tags-Web pages)
- Algorithms (PageRank,…)
§ Current focus on computationthan query
§ APIs include :
§ Attribute Transformers,
§ Structure transformers,
§ Connection Mining & Algorithms
§ GraphFrames – interesting development
§ Exercises (lotsof interesting graph data)
- Airline data,co-occurrence &co-citation from
papers
- The AlphaGo Community – ReTweet network
- Wikipedia Page Rank Analysisusing GraphX
Graphx Paper : https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-gonzalez.pdf
Pragel paper : http://kowshik.github.io/JPregel/pregel_paper.pdf
Scala, python support https://issues.apache.org/jira/browse/SPARK-3789
Java API for Graphxhttps://issues.apache.org/jira/browse/SPARK-3665
LDA https://issues.apache.org/jira/browse/SPARK-1405

Vertex Vertex(VertexId,VD) VD can be any object
Edges Edge(ED) ED can be any object
Graph Graph(VD,ED)
GraphX-Basics
V
StructuralQueries Indegrees, vertices,…
AttributeTransformers mapVertices, …
StructureTransformers, Join Reverse, subgraph
Connection Mining connectedComponents, triangles
Algorithms Aggregate messages, PageRank
InterestingObjects
o EdgeTriplet
o EdgeContext
http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html

GraphX-Basics
o Computational Model
o Directed MultiGraph
o Directed - so in-degree & out-degree
o MultiGraph – so multiple parallel edges between nodes including loops !
o Algorithms beware – can cyclic, loops
o Property Graph
o vertexID(64bit long) int
o Need property, cannot ignore it
o Vertex,Edge Parameterized over object types
o Vertex[(VertexId,VD)], Edge[ED]
o Attach user-defined objects to edges and vertices (ED/VD)
Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University
http://web.stanford.edu/class/cs246/handouts.html
G-N Algorithm for inbetweennessGood exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf

GraphX-Basics
§ For our hands-on we will run thru 2 Zeppelin notebooks
(https://goo.gl/qCwZiq & https://goo.gl/EKHCFq):
a. First we will work with the following Giraffe graph
o … carefully chosen with interesting betweenness and properties
o … for understanding the APIs
b. Second, we will develop a retweet pipeline
o With the AlphaGo twitter topic
o 2 GB/330K tweets, 200K edges,…
A D
C
E
FG
B
125
5
4.5
4.5
4
1.5
1.5
1
Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University
http://web.stanford.edu/class/cs246/handouts.html
G-N Algorithm for inbetweenness
val vertexList = List(
(1L, Person("Alice", 18)),
(2L, Person("Bernie", 17)),
(3L, Person("Cruz", 15)),
(4L, Person("Donald", 12)),
(5L, Person("Ed", 15)),
(6L, Person("Fran", 10)),
(7L, Person("Genghis",854)))
Good exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf
Fun facts about centrality betweenness : It shows how many paths an
edge is part of, i.e. relevancy. High centrality betweenness is the sign of a
bottleneck, point of single failure – these edges need HA, probably need
alternate paths for re routing, are susceptible to parasite infections and
good candidates for a cut !

Build A Graph
§ There are 4 ways
- GraphLoader.edgeListFile(…)
- From RDDs (shown above)
- fromEdgeTuples <- Id tuples
- fromEdges <- edgeList

Graph API Landscape
Separate algorithm from
Graph Implementation
Implementati
on of analytic
functions

Hint:
Many details are documented in the
object not in the class e.g.
PartitionStrategy, Graph,…

Community-Affiliation-Strengths
§ Applied in many ways
§ For example in Fraud & Security Applications
§ Triangle detection – for spam servers
§ The age of a community isrelated to the density of
triangles
- New community will have few triangles,then triangles
start to form
§ Strong affiliation ie Heavy hitter = sqrt(m) degrees!
- Heavy hitter triangle !
§ Connected Communities– structure

Algorithms
§ Graph-Parallel Computation
- aggregateMessages()Function
- Pregel() (https://issues.apache.org/jira/browse/SPARK-5062)
§ pageRank()
- Influential papers in a citation network
- Influencer in retweet
§ staticPageRank()
- Static no of iterations, dynamic tolerance – see the parameters (tol vs.
numIter)
§ personalizedPageRank()
- Personalized PageRank is a variation on PageRank that gives a rank
relative to a specified “source” vertex in the graph – “People You May
Know”
§ shortestPaths, SVD++
§ LabelPropagation(LPA) as described by Raghavanet al
in their 2007 paper “Near linear time algorithm to detect
community structures in large-scale networks”
- Computationally inexpensive way to Identify
communities
- Convergence not guaranteed
- Might end up with trivial solution i.e. single community
§ SDVPlusPlus takes an RDD of Edges
§ The Global Clustering Coefficient, is better in that it
always returns a number between 0 and 1
- For comparing connectnedness between different
sized communities
http://graphframes.github.io/user-guide.html

§ Versatile Function useful for implementing PageRank et al
- Can be difficult at first, but easier to comprehend if treated
as a combined map-reduce function (it was called
MapReduceTriplets !! With a slightly different signature)
o aggregateMessage[Msg](
o map(edgeContext=>mapFun,<- this can be up or downie sendToDst or Src!
o redcuce(Msg,Msg) => reduceFun)

If you really want to know what is
underneath the aggregateMessages()…

Pregel BSP API
§ Bulk SynchronousParallelMessagePassingAPI
- Developed by Leslie Valiant1
- Synchronizeddistributedsuper steps
- Super Step – local,in-memorycomputations
§ Computations on the vertex – “Think likea Vertex”
§ graphx.lib.LabelPropagationAlgorithmuses Pregelinternally
§ graphx.lib.PageRankuses Pragel(for the runUntilConvergenceWithOptionsversion,while
it uses the aggregateMessagesfor the runWithOptions(staticPageRank)version)
§ Pregel is implementedsuccinctlywith the old mapReduceTriplets API
1http://delivery.acm.org/10.1145/80000/79181/p103-
valiant.pdfhttp://www.slideshare.net/riyadparvez/pregel-35504069

Graphx Partitioning & Processing
§ Unlike a relationaltable,graph processing is contextual w.r.t
a neighborhood
- Maintain locality,equal size partitioning
§ Edge-cut : The communication& storage overhead of an
edge-cut is directlyproportional to thenumber ofedges
that are cut
§ Vertex-cut : Thecommunication and storage overhead ofa
vertex-cut is directlyproportional to the sum of the
number of machines spanned byeach vertex
§ Vertex-cut strategy by default (balance hot-spot issue due to
power law/Zipf’s Law) w/ min replication ofedges
§ Batch processing/not streaming
§ Org.apache.spark.graphx.Graph.
- partitionBy(partitionStrategy: PartitionStrategy, numPartitions: Int)
- partitionBy(partitionStrategy: PartitionStrategy)
- 4 PartitionStrategies
1) RandomVertexCut(usually the best) : random vertex cut that colocates all same-
direction edges between two vertices (hashingthe source and destination vertexIDs)
2) CanonicalRandomVertexCut - a random vertexcut that colocates all edges between
twovertices, regardless of direction (hashing the source and destination vertex IDs in
a canonical direction) … remember GraphX is multi graph
3) EdgePartition1D - Assigns edges to partitions using only the source vertex ID,
colocating edges withthe same source
4) EdgePartition2D : 2D partitioning of the sparse edge adjacency matrix
http://www.istc-cc.cmu.edu/publications/papers/2013/grades-graphx_with_fonts.pdf,https://www.sics.se/~amir/files/download/papers/jabeja-vc.pdf

AlphaGo Tweets Analytics Pipeline
Collect Data Transform & Extract
Features
GraphX Model
GraphX
Algorithms
Store
§ Initiallyprimed data (7days from
twitter)
§ Then used thesinceID to get
incremental tweets
§ Use application authentication for higher
rate
§ Used tweepy, wait_on_rate_limit=True,
wait_on_rate_limit_notify=True
§ ~330K tweets (see Download program),
2GB
§ => MongoDB ~820 MB w/compression)
§ Rich data (see MongoDB)
- Retweet, user ids, hash tags,
locations et al
§ This exercise only covers the retweet
interest network

2-Pipeline screen shots§ Get max tweet id
§ db.alphago.find().sort({id:-1}).limit(1).pretty()
- "id" : NumberLong("709550216467185664"),
§ db.alphago.find().sort({id:+1}).limit(1).pretty()
- "id" : NumberLong("709211132498714627")
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --drop --file tweets-20160313.txt
- 232296
- Min : "id" : NumberLong("705845567537221632")
- Max : "id" : NumberLong("709211104719872000")
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160314.txt
- +21368
- Min : "id" : NumberLong("705845567537221632"),
- Max : "id" : NumberLong("709550216467185664"),
- Count : 253664
- +38452
- Min: "id_str" : "705845567537221632",
- Max: "id" : NumberLong("709817136437403649")
- Count : 292116
- +17331
- Min: "id" : NumberLong("705845567537221632"),
- Max: "id" : NumberLong("710312094227300352"),
- Count : 309447
- +10797
- Count : 320244
- +11511
- Count : 331755
- +5166
- Count : 336921

3-Twitter gives lots of fields in Mongo

4-Extract Retweet Fields
&
5-Store in csv

6-Read as dataframe
7-Create Vertices, Edges & Objects
8-Finally the graph & run algorithms

The Art of an AlphaGo GraphX Vertex Vertex(VertexId,VD)
VD can be
any object
Edges Edge(ED)
ED can be
any object
Graph Graph(VD,ED)
[(VertexId , VD)] [(VertexId, VertexId, [(VertexId, VD)]ED)]
Vertex VertexEdge

An excursion into Graph Analytics with Apache Spark GraphX

An excursion into Graph Analytics with Apache Spark GraphX

More Related Content

What's hot (20)

Similar to An excursion into Graph Analytics with Apache Spark GraphX (20)

More from Krishna Sankar (20)

Recently uploaded (20)

An excursion into Graph Analytics with Apache Spark GraphX