SlideShare a Scribd company logo
Krishna Sankar
https://www.linkedin.com/in/ksankar
June 1, 2016
1. Paco Nathan, Scala Days 2015, https://www.youtube.com/watch?v=P_V71n-gtDs
2. Big Data Analytics withSpark-Apress http://www.amazon.com/Big-Data-Analytics-Spark-
Practitioners/dp/1484209656
3. Apache Spark Graph Processing-Packt https://www.packtpub.com/big-data-and-business-intelligence/apache-
spark-graph-processing
4. Spark GarphX in Action - Manning
5. http://hortonworks.com/blog/introduction-to-data-science-with-apache-spark/
6. http://stanford.edu/~rezab/nips2014workshop/slides/ankur.pdf
7. Mining Massive Datasets book v2 http://infolab.stanford.edu/~ullman/mmds/ch10.pdf
- http://web.stanford.edu/class/cs246/handouts.html
8. http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
9. http://kukuruku.co/hub/algorithms/social-network-analysis-spark-graphx
10. Zeppelin Setup
- http://sparktutorials.net/setup-your-zeppelin-notebook-for-data-science-in-apache-spark
11. Data
- http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
- http://openflights.org/data.html
12. https://adventuresincareerdevelopment.files.wordpress.com/2015/04/standing-on-the-shoulders-of-giants.png
Thanks to the Giants whose work helped to prepare this tutorial
Agenda-Topic Time Data Description
1. Graph Processing Is Introduced 10 Min Graph Processing vs GraphDB,
Introduction to Spark GraphX
2. Basics are Discussed & A Graph Is
Built
10 Min Simple
Graph
Edge,Vertex, Graph
3. GraphX API Landscape Is discussed 5 Min Explain APIs
4. Graph Structures Are Examined 5 Min Indegree, outdegree et al
5. Community, Affiliation & Strengths
Are Explored
5 Min Connected components, Triangle et al
6. Algorithms Are Developed 10 Min PartitionStrategy (is what makes GraphParallel work),
PageRank, Connected Components; APIs to implement
Algorithms viz. aggregateMessages, Pregel
Type Signature - aggregateMessages
Algorithms with aggregateMessages
7. Case Study (AlphaGo Tweets) Is
Conducted
10 Min AlphaGo
Retweet
Data
E2E Pipeline w/real data, Map attributes to properties,
vertices & edges, create graph and run algorithms
8. Questions Are Asked & Answered 10 Min
AGENDA Titles Styled after Asimov’s ROBOT SERIES !!!
Graph Applications
§ Many applications from social network to understanding collaboration,diseases, routing
logistics and others
§ Came across 3 interesting applications, as I was preparing the materials !!
1) Project effectiveness by social network analysis on the projects' collaboration structures :
“EC is interested in the impact generated by those projects … analyzing this latent impact
of TEL projects by applying social network analysis on the projects' collaboration structures
… TEL projects funded under FP6, FP7 and eContentplus, and identifies central projects
and strong, sustained organizational collaboration ties.”
2) Weak ties in pathogen transmission : “structural motif … Giraffe social networks are
characterized by social cliques in which individuals associate more with members of their
own social clique than with those outside their clique … Individuals involved in weak,
between-clique social interactions are hypothesized to serve as bridges by which an
infection may enter a clique and, hence, may experience higher infection risk … individuals
who engaged in more between-clique associations, that is, those with multiple weak ties,
were more likely to be infected with gastrointestinal helminth parasites ... “
3) Panama papers : The leak presented them with a wealth of information, millions of
documents, but no guide to structure … (ie community detection)
Graph based systems
§ Graphs are everywhere – web pages, social
networks, people’s web browsingbehavior,
logisticsetc.
- Workingw/ graphshas become more
important. And harder due to the scaleand
complexityof algorithms
§ Two kinds– processingandquerying
§ Processing– GraphX, Pregel BSP, GraphLab
§ Querying– AllegroGraph,Titan, Neo4j, RDF
stores.
- SPARQL query language
§ Graph DBs have queries, canprocess
large dataset
§ Graph processingcanrun complex
algorithms
§ For Graph Analyticssystemsboth are
required
- Processon graphx, store in neo4j is a
good alternative.
- E.g. Panamapapers analytics
Graph Processing Frameworks
§ History – Graph based systems were specializedinnature
- Graph partitioningis hard
- Primitives of record based systemswere different than what graphbased systems
needed
§ Rapid innovationin distributed data processingframeworks– MapReduce etc
- Disk based, with limited partitioningabilities
- Still an impedancemismatch
§ Graph-parallelover Data ParallelRDD ! – Best of both worlds ?
ü We will see, later, GrapnX’s PartitionStrategy
Enter Spark§ More powerful partitioning mechanism
§ In-memory system makes iterative processing
easier
§ GraphX – Graph processing built ontop of Spark
§ Graph processing at scale (distributed system)
§ Fast evolving project
§
§
-
-
§
§
§
GraphX …§ Graph processing at Scale - “Graph Parallel System”
§ Has a rich
- Computational Model
- Edges, Vertices, Property Graph
- Bipartite & tripartite graphs (Users-Tags-Web pages)
- Algorithms (PageRank,…)
§ Current focus on computationthan query
§ APIs include :
§ Attribute Transformers,
§ Structure transformers,
§ Connection Mining & Algorithms
§ GraphFrames – interesting development
§ Exercises (lotsof interesting graph data)
- Airline data,co-occurrence &co-citation from
papers
- The AlphaGo Community – ReTweet network
- Wikipedia Page Rank Analysisusing GraphX
Graphx Paper : https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-gonzalez.pdf
Pragel paper : http://kowshik.github.io/JPregel/pregel_paper.pdf
Scala, python support https://issues.apache.org/jira/browse/SPARK-3789
Java API for Graphxhttps://issues.apache.org/jira/browse/SPARK-3665
LDA https://issues.apache.org/jira/browse/SPARK-1405
Vertex Vertex(VertexId,VD) VD can be any object
Edges Edge(ED) ED can be any object
Graph Graph(VD,ED)
GraphX-Basics
V
StructuralQueries Indegrees, vertices,…
AttributeTransformers mapVertices, …
StructureTransformers, Join Reverse, subgraph
Connection Mining connectedComponents, triangles
Algorithms Aggregate messages, PageRank
InterestingObjects
o EdgeTriplet
o EdgeContext
http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
GraphX-Basics
o Computational Model
o Directed MultiGraph
o Directed - so in-degree & out-degree
o MultiGraph – so multiple parallel edges between nodes including loops !
o Algorithms beware – can cyclic, loops
o Property Graph
o vertexID(64bit long) int
o Need property, cannot ignore it
o Vertex,Edge Parameterized over object types
o Vertex[(VertexId,VD)], Edge[ED]
o Attach user-defined objects to edges and vertices (ED/VD)
Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University
http://web.stanford.edu/class/cs246/handouts.html
G-N Algorithm for inbetweennessGood exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf
GraphX-Basics
§ For our hands-on we will run thru 2 Zeppelin notebooks
(https://goo.gl/qCwZiq & https://goo.gl/EKHCFq):
a. First we will work with the following Giraffe graph
o … carefully chosen with interesting betweenness and properties
o … for understanding the APIs
b. Second, we will develop a retweet pipeline
o With the AlphaGo twitter topic
o 2 GB/330K tweets, 200K edges,…
A D
C
E
FG
B
125
5
4.5
4.5
4
1.5
1.5
1
Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University
http://web.stanford.edu/class/cs246/handouts.html
G-N Algorithm for inbetweenness
val vertexList = List(
(1L, Person("Alice", 18)),
(2L, Person("Bernie", 17)),
(3L, Person("Cruz", 15)),
(4L, Person("Donald", 12)),
(5L, Person("Ed", 15)),
(6L, Person("Fran", 10)),
(7L, Person("Genghis",854)))
Good exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf
Fun facts about centrality betweenness : It shows how many paths an
edge is part of, i.e. relevancy. High centrality betweenness is the sign of a
bottleneck, point of single failure – these edges need HA, probably need
alternate paths for re routing, are susceptible to parasite infections and
good candidates for a cut !
Build A Graph
§ There are 4 ways
- GraphLoader.edgeListFile(…)
- From RDDs (shown above)
- fromEdgeTuples <- Id tuples
- fromEdges <- edgeList
Graph API Landscape
Separate algorithm from
Graph Implementation
Implementati
on of analytic
functions
Hint:
Many details are documented in the
object not in the class e.g.
PartitionStrategy, Graph,…
Graphx Structure API
Community-Affiliation-Strengths
§ Applied in many ways
§ For example in Fraud & Security Applications
§ Triangle detection – for spam servers
§ The age of a community isrelated to the density of
triangles
- New community will have few triangles,then triangles
start to form
§ Strong affiliation ie Heavy hitter = sqrt(m) degrees!
- Heavy hitter triangle !
§ Connected Communities– structure
Algorithms
§ Graph-Parallel Computation
- aggregateMessages()Function
- Pregel() (https://issues.apache.org/jira/browse/SPARK-5062)
§ pageRank()
- Influential papers in a citation network
- Influencer in retweet
§ staticPageRank()
- Static no of iterations, dynamic tolerance – see the parameters (tol vs.
numIter)
§ personalizedPageRank()
- Personalized PageRank is a variation on PageRank that gives a rank
relative to a specified “source” vertex in the graph – “People You May
Know”
§ shortestPaths, SVD++
§ LabelPropagation(LPA) as described by Raghavanet al
in their 2007 paper “Near linear time algorithm to detect
community structures in large-scale networks”
- Computationally inexpensive way to Identify
communities
- Convergence not guaranteed
- Might end up with trivial solution i.e. single community
§ SDVPlusPlus takes an RDD of Edges
§ The Global Clustering Coefficient, is better in that it
always returns a number between 0 and 1
- For comparing connectnedness between different
sized communities
http://graphframes.github.io/user-guide.html
§ Versatile Function useful for implementing PageRank et al
- Can be difficult at first, but easier to comprehend if treated
as a combined map-reduce function (it was called
MapReduceTriplets !! With a slightly different signature)
o aggregateMessage[Msg](
o map(edgeContext=>mapFun,<- this can be up or downie sendToDst or Src!
o redcuce(Msg,Msg) => reduceFun)
If you really want to know what is
underneath the aggregateMessages()…
sendToDst vs sendToSrc
Pregel BSP API
§ Bulk SynchronousParallelMessagePassingAPI
- Developed by Leslie Valiant1
- Synchronizeddistributedsuper steps
- Super Step – local,in-memorycomputations
§ Computations on the vertex – “Think likea Vertex”
§ graphx.lib.LabelPropagationAlgorithmuses Pregelinternally
§ graphx.lib.PageRankuses Pragel(for the runUntilConvergenceWithOptionsversion,while
it uses the aggregateMessagesfor the runWithOptions(staticPageRank)version)
§ Pregel is implementedsuccinctlywith the old mapReduceTriplets API
1http://delivery.acm.org/10.1145/80000/79181/p103-
valiant.pdfhttp://www.slideshare.net/riyadparvez/pregel-35504069
Graphx Partitioning & Processing
§ Unlike a relationaltable,graph processing is contextual w.r.t
a neighborhood
- Maintain locality,equal size partitioning
§ Edge-cut : The communication& storage overhead of an
edge-cut is directlyproportional to thenumber ofedges
that are cut
§ Vertex-cut : Thecommunication and storage overhead ofa
vertex-cut is directlyproportional to the sum of the
number of machines spanned byeach vertex
§ Vertex-cut strategy by default (balance hot-spot issue due to
power law/Zipf’s Law) w/ min replication ofedges
§ Batch processing/not streaming
§ Org.apache.spark.graphx.Graph.
- partitionBy(partitionStrategy: PartitionStrategy, numPartitions: Int)
- partitionBy(partitionStrategy: PartitionStrategy)
- 4 PartitionStrategies
1) RandomVertexCut(usually the best) : random vertex cut that colocates all same-
direction edges between two vertices (hashingthe source and destination vertexIDs)
2) CanonicalRandomVertexCut - a random vertexcut that colocates all edges between
twovertices, regardless of direction (hashing the source and destination vertex IDs in
a canonical direction) … remember GraphX is multi graph
3) EdgePartition1D - Assigns edges to partitions using only the source vertex ID,
colocating edges withthe same source
4) EdgePartition2D : 2D partitioning of the sparse edge adjacency matrix
http://www.istc-cc.cmu.edu/publications/papers/2013/grades-graphx_with_fonts.pdf,https://www.sics.se/~amir/files/download/papers/jabeja-vc.pdf
AlphaGo Tweets Analytics Pipeline
Collect Data Transform & Extract
Features
GraphX Model
GraphX
Algorithms
Store
§ Initiallyprimed data (7days from
twitter)
§ Then used thesinceID to get
incremental tweets
§ Use application authentication for higher
rate
§ Used tweepy, wait_on_rate_limit=True,
wait_on_rate_limit_notify=True
§ ~330K tweets (see Download program),
2GB
§ => MongoDB ~820 MB w/compression)
§ Rich data (see MongoDB)
- Retweet, user ids, hash tags,
locations et al
§ This exercise only covers the retweet
interest network
1-Extract Tweets
2-Pipeline screen shots§ Get max tweet id
§ db.alphago.find().sort({id:-1}).limit(1).pretty()
- "id" : NumberLong("709550216467185664"),
§ db.alphago.find().sort({id:+1}).limit(1).pretty()
- "id" : NumberLong("709211132498714627")
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --drop --file tweets-20160313.txt
- 232296
- Min : "id" : NumberLong("705845567537221632")
- Max : "id" : NumberLong("709211104719872000")
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160314.txt
- +21368
- Min : "id" : NumberLong("705845567537221632"),
- Max : "id" : NumberLong("709550216467185664"),
- Count : 253664
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160315.txt
- +38452
- Min: "id_str" : "705845567537221632",
- Max: "id" : NumberLong("709817136437403649")
- Count : 292116
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160316.txt
- +17331
- Min: "id" : NumberLong("705845567537221632"),
- Max: "id" : NumberLong("710312094227300352"),
- Count : 309447
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160318.txt
- +10797
- Min: "id" : NumberLong("705845567537221632"),
- Max: "id" : NumberLong("710883718903140353"),
- Count : 320244
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160320.txt
- +11511
- Min: "id" : NumberLong("705845567537221632"),
- Max: "id" : NumberLong("711720090954108928"),
- Count : 331755
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160322.txt
- +5166
- Min: "id" : NumberLong("705845567537221632"),
- Max: "id" : NumberLong("712410062820511744"),
- Count : 336921
3-Twitter gives lots of fields in Mongo
4-Extract Retweet Fields
&
5-Store in csv
6-Read as dataframe
7-Create Vertices, Edges & Objects
8-Finally the graph & run algorithms
The Art of an AlphaGo GraphX Vertex Vertex(VertexId,VD)
VD can be
any object
Edges Edge(ED)
ED can be
any object
Graph Graph(VD,ED)
[(VertexId , VD)] [(VertexId, VertexId, [(VertexId, VD)]ED)]
Vertex VertexEdge
1
2
3
4
5
An excursion into Graph Analytics with Apache Spark GraphX

More Related Content

PPTX
Gephi, Graphx, and Giraph
PDF
GraphX: Graph analytics for insights about developer communities
PPTX
Apache Spark GraphX highlights.
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PPTX
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
PDF
Spark Summit 2015 keynote: Making Big Data Simple with Spark
PDF
Graph Analytics in Spark
PPTX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Gephi, Graphx, and Giraph
GraphX: Graph analytics for insights about developer communities
Apache Spark GraphX highlights.
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Spark Summit 2015 keynote: Making Big Data Simple with Spark
Graph Analytics in Spark
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks

What's hot (20)

PDF
Machine Learning and GraphX
PDF
Databricks with R: Deep Dive
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PDF
TinkerPop: a story of graphs, DBs, and graph DBs
PPTX
Data Science at Scale by Sarah Guido
PDF
H2O World - H2O Rains with Databricks Cloud
PDF
Signals from outer space
PDF
Microservices, containers, and machine learning
PDF
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
PDF
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
PDF
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
PDF
Strata EU 2014: Spark Streaming Case Studies
PDF
Congressional PageRank: Graph Analytics of US Congress With Neo4j
PDF
Why spark by Stratio - v.1.0
PDF
The BDAS Open Source Community
PDF
GraphFrames: Graph Queries In Spark SQL
ODP
Graphs are everywhere! Distributed graph computing with Spark GraphX
PDF
Building Scalable Big Data Pipelines
PDF
Extracting Insights from Data at Twitter
PDF
Use of standards and related issues in predictive analytics
Machine Learning and GraphX
Databricks with R: Deep Dive
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
TinkerPop: a story of graphs, DBs, and graph DBs
Data Science at Scale by Sarah Guido
H2O World - H2O Rains with Databricks Cloud
Signals from outer space
Microservices, containers, and machine learning
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
Strata EU 2014: Spark Streaming Case Studies
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Why spark by Stratio - v.1.0
The BDAS Open Source Community
GraphFrames: Graph Queries In Spark SQL
Graphs are everywhere! Distributed graph computing with Spark GraphX
Building Scalable Big Data Pipelines
Extracting Insights from Data at Twitter
Use of standards and related issues in predictive analytics
Ad

Similar to An excursion into Graph Analytics with Apache Spark GraphX (20)

PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PDF
Web-Scale Graph Analytics with Apache® Spark™
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
PDF
Ling liu part 02:big graph processing
PDF
What Makes Graph Queries Difficult?
PDF
Graph Analytics with ArangoDB
PDF
Graph analytic and machine learning
PDF
F14 lec12graphs
PDF
Distributed graph processing
PDF
Multiplaform Solution for Graph Datasources
PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
PDF
Graph machine learning table of content
PDF
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Leveraging Graphs for Better AI
PDF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
PPT
Making sense of the Graph Revolution
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PDF
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
PDF
Apache Spark Presentation good for big data
GraphFrames: DataFrame-based graphs for Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
Graphs in data structures are non-linear data structures made up of a finite ...
Ling liu part 02:big graph processing
What Makes Graph Queries Difficult?
Graph Analytics with ArangoDB
Graph analytic and machine learning
F14 lec12graphs
Distributed graph processing
Multiplaform Solution for Graph Datasources
Web-Scale Graph Analytics with Apache® Spark™
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Graph machine learning table of content
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Leveraging Graphs for Better AI
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Making sense of the Graph Revolution
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
Apache Spark Presentation good for big data
Ad

More from Krishna Sankar (20)

PDF
Pandas, Data Wrangling & Data Science
PDF
An excursion into Text Analytics with Apache Spark
PDF
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
PDF
Data Science with Spark
PDF
Architecture in action 01
PDF
Data Science with Spark - Training at SparkSummit (East)
PDF
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
PDF
R, Data Wrangling & Kaggle Data Science Competitions
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
PDF
Data Science Folk Knowledge
PDF
Data Wrangling For Kaggle Data Science Competitions
PDF
Bayesian Machine Learning - Naive Bayes
PDF
AWS VPC distilled for MongoDB devOps
PDF
The Art of Social Media Analysis with Twitter & Python
PDF
Big Data Engineering - Top 10 Pragmatics
PDF
Scrum debrief to team
PDF
The Art of Big Data
PDF
Precision Time Synchronization
PDF
The Hitchhiker’s Guide to Kaggle
PDF
Nosql hands on handout 04
Pandas, Data Wrangling & Data Science
An excursion into Text Analytics with Apache Spark
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Data Science with Spark
Architecture in action 01
Data Science with Spark - Training at SparkSummit (East)
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Kaggle Data Science Competitions
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Data Science Folk Knowledge
Data Wrangling For Kaggle Data Science Competitions
Bayesian Machine Learning - Naive Bayes
AWS VPC distilled for MongoDB devOps
The Art of Social Media Analysis with Twitter & Python
Big Data Engineering - Top 10 Pragmatics
Scrum debrief to team
The Art of Big Data
Precision Time Synchronization
The Hitchhiker’s Guide to Kaggle
Nosql hands on handout 04

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Database Infoormation System (DBIS).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Business Ppt On Nestle.pptx huunnnhhgfvu
climate analysis of Dhaka ,Banglades.pptx
1_Introduction to advance data techniques.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
.pdf is not working space design for the following data for the following dat...
IB Computer Science - Internal Assessment.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
IBA_Chapter_11_Slides_Final_Accessible.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Database Infoormation System (DBIS).pptx
Reliability_Chapter_ presentation 1221.5784
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx

An excursion into Graph Analytics with Apache Spark GraphX

  • 2. 1. Paco Nathan, Scala Days 2015, https://www.youtube.com/watch?v=P_V71n-gtDs 2. Big Data Analytics withSpark-Apress http://www.amazon.com/Big-Data-Analytics-Spark- Practitioners/dp/1484209656 3. Apache Spark Graph Processing-Packt https://www.packtpub.com/big-data-and-business-intelligence/apache- spark-graph-processing 4. Spark GarphX in Action - Manning 5. http://hortonworks.com/blog/introduction-to-data-science-with-apache-spark/ 6. http://stanford.edu/~rezab/nips2014workshop/slides/ankur.pdf 7. Mining Massive Datasets book v2 http://infolab.stanford.edu/~ullman/mmds/ch10.pdf - http://web.stanford.edu/class/cs246/handouts.html 8. http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm 9. http://kukuruku.co/hub/algorithms/social-network-analysis-spark-graphx 10. Zeppelin Setup - http://sparktutorials.net/setup-your-zeppelin-notebook-for-data-science-in-apache-spark 11. Data - http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time - http://openflights.org/data.html 12. https://adventuresincareerdevelopment.files.wordpress.com/2015/04/standing-on-the-shoulders-of-giants.png Thanks to the Giants whose work helped to prepare this tutorial
  • 3. Agenda-Topic Time Data Description 1. Graph Processing Is Introduced 10 Min Graph Processing vs GraphDB, Introduction to Spark GraphX 2. Basics are Discussed & A Graph Is Built 10 Min Simple Graph Edge,Vertex, Graph 3. GraphX API Landscape Is discussed 5 Min Explain APIs 4. Graph Structures Are Examined 5 Min Indegree, outdegree et al 5. Community, Affiliation & Strengths Are Explored 5 Min Connected components, Triangle et al 6. Algorithms Are Developed 10 Min PartitionStrategy (is what makes GraphParallel work), PageRank, Connected Components; APIs to implement Algorithms viz. aggregateMessages, Pregel Type Signature - aggregateMessages Algorithms with aggregateMessages 7. Case Study (AlphaGo Tweets) Is Conducted 10 Min AlphaGo Retweet Data E2E Pipeline w/real data, Map attributes to properties, vertices & edges, create graph and run algorithms 8. Questions Are Asked & Answered 10 Min AGENDA Titles Styled after Asimov’s ROBOT SERIES !!!
  • 4. Graph Applications § Many applications from social network to understanding collaboration,diseases, routing logistics and others § Came across 3 interesting applications, as I was preparing the materials !! 1) Project effectiveness by social network analysis on the projects' collaboration structures : “EC is interested in the impact generated by those projects … analyzing this latent impact of TEL projects by applying social network analysis on the projects' collaboration structures … TEL projects funded under FP6, FP7 and eContentplus, and identifies central projects and strong, sustained organizational collaboration ties.” 2) Weak ties in pathogen transmission : “structural motif … Giraffe social networks are characterized by social cliques in which individuals associate more with members of their own social clique than with those outside their clique … Individuals involved in weak, between-clique social interactions are hypothesized to serve as bridges by which an infection may enter a clique and, hence, may experience higher infection risk … individuals who engaged in more between-clique associations, that is, those with multiple weak ties, were more likely to be infected with gastrointestinal helminth parasites ... “ 3) Panama papers : The leak presented them with a wealth of information, millions of documents, but no guide to structure … (ie community detection)
  • 5. Graph based systems § Graphs are everywhere – web pages, social networks, people’s web browsingbehavior, logisticsetc. - Workingw/ graphshas become more important. And harder due to the scaleand complexityof algorithms § Two kinds– processingandquerying § Processing– GraphX, Pregel BSP, GraphLab § Querying– AllegroGraph,Titan, Neo4j, RDF stores. - SPARQL query language § Graph DBs have queries, canprocess large dataset § Graph processingcanrun complex algorithms § For Graph Analyticssystemsboth are required - Processon graphx, store in neo4j is a good alternative. - E.g. Panamapapers analytics
  • 6. Graph Processing Frameworks § History – Graph based systems were specializedinnature - Graph partitioningis hard - Primitives of record based systemswere different than what graphbased systems needed § Rapid innovationin distributed data processingframeworks– MapReduce etc - Disk based, with limited partitioningabilities - Still an impedancemismatch § Graph-parallelover Data ParallelRDD ! – Best of both worlds ? ü We will see, later, GrapnX’s PartitionStrategy
  • 7. Enter Spark§ More powerful partitioning mechanism § In-memory system makes iterative processing easier § GraphX – Graph processing built ontop of Spark § Graph processing at scale (distributed system) § Fast evolving project § § - - § § §
  • 8. GraphX …§ Graph processing at Scale - “Graph Parallel System” § Has a rich - Computational Model - Edges, Vertices, Property Graph - Bipartite & tripartite graphs (Users-Tags-Web pages) - Algorithms (PageRank,…) § Current focus on computationthan query § APIs include : § Attribute Transformers, § Structure transformers, § Connection Mining & Algorithms § GraphFrames – interesting development § Exercises (lotsof interesting graph data) - Airline data,co-occurrence &co-citation from papers - The AlphaGo Community – ReTweet network - Wikipedia Page Rank Analysisusing GraphX Graphx Paper : https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-gonzalez.pdf Pragel paper : http://kowshik.github.io/JPregel/pregel_paper.pdf Scala, python support https://issues.apache.org/jira/browse/SPARK-3789 Java API for Graphxhttps://issues.apache.org/jira/browse/SPARK-3665 LDA https://issues.apache.org/jira/browse/SPARK-1405
  • 9. Vertex Vertex(VertexId,VD) VD can be any object Edges Edge(ED) ED can be any object Graph Graph(VD,ED) GraphX-Basics V StructuralQueries Indegrees, vertices,… AttributeTransformers mapVertices, … StructureTransformers, Join Reverse, subgraph Connection Mining connectedComponents, triangles Algorithms Aggregate messages, PageRank InterestingObjects o EdgeTriplet o EdgeContext http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
  • 10. GraphX-Basics o Computational Model o Directed MultiGraph o Directed - so in-degree & out-degree o MultiGraph – so multiple parallel edges between nodes including loops ! o Algorithms beware – can cyclic, loops o Property Graph o vertexID(64bit long) int o Need property, cannot ignore it o Vertex,Edge Parameterized over object types o Vertex[(VertexId,VD)], Edge[ED] o Attach user-defined objects to edges and vertices (ED/VD) Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University http://web.stanford.edu/class/cs246/handouts.html G-N Algorithm for inbetweennessGood exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf
  • 11. GraphX-Basics § For our hands-on we will run thru 2 Zeppelin notebooks (https://goo.gl/qCwZiq & https://goo.gl/EKHCFq): a. First we will work with the following Giraffe graph o … carefully chosen with interesting betweenness and properties o … for understanding the APIs b. Second, we will develop a retweet pipeline o With the AlphaGo twitter topic o 2 GB/330K tweets, 200K edges,… A D C E FG B 125 5 4.5 4.5 4 1.5 1.5 1 Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University http://web.stanford.edu/class/cs246/handouts.html G-N Algorithm for inbetweenness val vertexList = List( (1L, Person("Alice", 18)), (2L, Person("Bernie", 17)), (3L, Person("Cruz", 15)), (4L, Person("Donald", 12)), (5L, Person("Ed", 15)), (6L, Person("Fran", 10)), (7L, Person("Genghis",854))) Good exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf Fun facts about centrality betweenness : It shows how many paths an edge is part of, i.e. relevancy. High centrality betweenness is the sign of a bottleneck, point of single failure – these edges need HA, probably need alternate paths for re routing, are susceptible to parasite infections and good candidates for a cut !
  • 12. Build A Graph § There are 4 ways - GraphLoader.edgeListFile(…) - From RDDs (shown above) - fromEdgeTuples <- Id tuples - fromEdges <- edgeList
  • 13. Graph API Landscape Separate algorithm from Graph Implementation Implementati on of analytic functions
  • 14. Hint: Many details are documented in the object not in the class e.g. PartitionStrategy, Graph,…
  • 16. Community-Affiliation-Strengths § Applied in many ways § For example in Fraud & Security Applications § Triangle detection – for spam servers § The age of a community isrelated to the density of triangles - New community will have few triangles,then triangles start to form § Strong affiliation ie Heavy hitter = sqrt(m) degrees! - Heavy hitter triangle ! § Connected Communities– structure
  • 17. Algorithms § Graph-Parallel Computation - aggregateMessages()Function - Pregel() (https://issues.apache.org/jira/browse/SPARK-5062) § pageRank() - Influential papers in a citation network - Influencer in retweet § staticPageRank() - Static no of iterations, dynamic tolerance – see the parameters (tol vs. numIter) § personalizedPageRank() - Personalized PageRank is a variation on PageRank that gives a rank relative to a specified “source” vertex in the graph – “People You May Know” § shortestPaths, SVD++ § LabelPropagation(LPA) as described by Raghavanet al in their 2007 paper “Near linear time algorithm to detect community structures in large-scale networks” - Computationally inexpensive way to Identify communities - Convergence not guaranteed - Might end up with trivial solution i.e. single community § SDVPlusPlus takes an RDD of Edges § The Global Clustering Coefficient, is better in that it always returns a number between 0 and 1 - For comparing connectnedness between different sized communities http://graphframes.github.io/user-guide.html
  • 18. § Versatile Function useful for implementing PageRank et al - Can be difficult at first, but easier to comprehend if treated as a combined map-reduce function (it was called MapReduceTriplets !! With a slightly different signature) o aggregateMessage[Msg]( o map(edgeContext=>mapFun,<- this can be up or downie sendToDst or Src! o redcuce(Msg,Msg) => reduceFun)
  • 19. If you really want to know what is underneath the aggregateMessages()…
  • 21. Pregel BSP API § Bulk SynchronousParallelMessagePassingAPI - Developed by Leslie Valiant1 - Synchronizeddistributedsuper steps - Super Step – local,in-memorycomputations § Computations on the vertex – “Think likea Vertex” § graphx.lib.LabelPropagationAlgorithmuses Pregelinternally § graphx.lib.PageRankuses Pragel(for the runUntilConvergenceWithOptionsversion,while it uses the aggregateMessagesfor the runWithOptions(staticPageRank)version) § Pregel is implementedsuccinctlywith the old mapReduceTriplets API 1http://delivery.acm.org/10.1145/80000/79181/p103- valiant.pdfhttp://www.slideshare.net/riyadparvez/pregel-35504069
  • 22. Graphx Partitioning & Processing § Unlike a relationaltable,graph processing is contextual w.r.t a neighborhood - Maintain locality,equal size partitioning § Edge-cut : The communication& storage overhead of an edge-cut is directlyproportional to thenumber ofedges that are cut § Vertex-cut : Thecommunication and storage overhead ofa vertex-cut is directlyproportional to the sum of the number of machines spanned byeach vertex § Vertex-cut strategy by default (balance hot-spot issue due to power law/Zipf’s Law) w/ min replication ofedges § Batch processing/not streaming § Org.apache.spark.graphx.Graph. - partitionBy(partitionStrategy: PartitionStrategy, numPartitions: Int) - partitionBy(partitionStrategy: PartitionStrategy) - 4 PartitionStrategies 1) RandomVertexCut(usually the best) : random vertex cut that colocates all same- direction edges between two vertices (hashingthe source and destination vertexIDs) 2) CanonicalRandomVertexCut - a random vertexcut that colocates all edges between twovertices, regardless of direction (hashing the source and destination vertex IDs in a canonical direction) … remember GraphX is multi graph 3) EdgePartition1D - Assigns edges to partitions using only the source vertex ID, colocating edges withthe same source 4) EdgePartition2D : 2D partitioning of the sparse edge adjacency matrix http://www.istc-cc.cmu.edu/publications/papers/2013/grades-graphx_with_fonts.pdf,https://www.sics.se/~amir/files/download/papers/jabeja-vc.pdf
  • 23. AlphaGo Tweets Analytics Pipeline Collect Data Transform & Extract Features GraphX Model GraphX Algorithms Store § Initiallyprimed data (7days from twitter) § Then used thesinceID to get incremental tweets § Use application authentication for higher rate § Used tweepy, wait_on_rate_limit=True, wait_on_rate_limit_notify=True § ~330K tweets (see Download program), 2GB § => MongoDB ~820 MB w/compression) § Rich data (see MongoDB) - Retweet, user ids, hash tags, locations et al § This exercise only covers the retweet interest network
  • 25. 2-Pipeline screen shots§ Get max tweet id § db.alphago.find().sort({id:-1}).limit(1).pretty() - "id" : NumberLong("709550216467185664"), § db.alphago.find().sort({id:+1}).limit(1).pretty() - "id" : NumberLong("709211132498714627") § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --drop --file tweets-20160313.txt - 232296 - Min : "id" : NumberLong("705845567537221632") - Max : "id" : NumberLong("709211104719872000") § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160314.txt - +21368 - Min : "id" : NumberLong("705845567537221632"), - Max : "id" : NumberLong("709550216467185664"), - Count : 253664 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160315.txt - +38452 - Min: "id_str" : "705845567537221632", - Max: "id" : NumberLong("709817136437403649") - Count : 292116 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160316.txt - +17331 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("710312094227300352"), - Count : 309447 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160318.txt - +10797 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("710883718903140353"), - Count : 320244 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160320.txt - +11511 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("711720090954108928"), - Count : 331755 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160322.txt - +5166 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("712410062820511744"), - Count : 336921
  • 26. 3-Twitter gives lots of fields in Mongo
  • 28. 6-Read as dataframe 7-Create Vertices, Edges & Objects 8-Finally the graph & run algorithms
  • 29. The Art of an AlphaGo GraphX Vertex Vertex(VertexId,VD) VD can be any object Edges Edge(ED) ED can be any object Graph Graph(VD,ED) [(VertexId , VD)] [(VertexId, VertexId, [(VertexId, VD)]ED)] Vertex VertexEdge