MILAN 20/21.11.2015
Graphs are everywhere!
Distributed graph computing with Spark GraphX
Andrea Iacono
MILAN 20/21.11.2015 - Andrea Iacono
Agenda:
●
Graph definitions and usages
●
GraphX introduction
●
Pregel
●
Code examples
The main focus will be the programming model
The code is available at:
https://github.com/andreaiacono/TalkGraphX
MILAN 20/21.11.2015 - Andrea Iacono
A graph is a set of vertices and edges that connect them:
Graphs are used for modeling very different domains.
Edge
Verte
x
MILAN 20/21.11.2015 - Andrea Iacono
Network
s
MILAN 20/21.11.2015 - Andrea Iacono
Routing
MILAN 20/21.11.2015 - Andrea Iacono
Page Rank
MILAN 20/21.11.2015 - Andrea Iacono
Definitions
Undirected Directed
MILAN 20/21.11.2015 - Andrea Iacono
Definitions
Connected Disconnected
MILAN 20/21.11.2015 - Andrea Iacono
Definitions
K5
K2,3
Complete Bipartite (and complete)
MILAN 20/21.11.2015 - Andrea Iacono
Definitions
Cyclic Acyclic
MILAN 20/21.11.2015 - Andrea Iacono
Definitions
Multigraph Pseudograph
MILAN 20/21.11.2015 - Andrea Iacono
Definitions
An undirected acyclic connected graph is a tree!
MILAN 20/21.11.2015 - Andrea Iacono
What's wrong with MapReduce?
Every run of MapReduce reads from disk (e.g. HDFS) the initial data,
computes the results and then stores them on disk; since most
algorithms on graphs are iterative, this means that for every iteration
the whole data must be read and written from/to disk.
It's better to use a distributed dataflow framework
MILAN 20/21.11.2015 - Andrea Iacono
GraphX is a graph processing system
built on top of Apache Spark
“Graph processing systems represent graph structured data as a property
graph, which associates user-defined properties with each vertex and edge.”
“The Spark storage abstraction called Resilient Distributed Datasets (RDDs)
enables applications to keep data in memory, which is essential for iterative
graph algorithms.”
“RDDs permit user-defined data partitioning, and the execution engine can
exploit this to co-partition RDDs and co-schedule tasks to avoid data
movement. This is essential for encoding partitioned graphs.”
Excerpt from GraphX: Graph Processing in a Distributed Dataflow Framework
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf
MILAN 20/21.11.2015 - Andrea Iacono
GraphX / Spark software stack
(image source: Spark site)
MILAN 20/21.11.2015 - Andrea Iacono
Graph Databases
●
Storage
●
Query Language
●
Transactions
●
Examples:
●
Neo4j
●
OrientDB
●
Titan
●
APIs for traversing and
processing
●
Better performance
(in-memory data)
●
Examples:
●
GraphX
●
Giraphe
●
GraphLab
Graph Processing
Systems
MILAN 20/21.11.2015 - Andrea Iacono
Pregel
is a computational model designed by Google
(https://kowshik.github.io/JPregel/pregel_paper.pdf)
It consists of a sequence of supersteps until termination. In each superstep,
every vertex can:
●
modify its state or the one of any of its neighbours
●
receive the messages sent to it during the previous superstep
●
send messages to its neighbours (that will be received in next superstep)
●
vote to halt
When a node votes to halt, it goes to inactive state; if in a later superstep it
receives a message, the framework will awake it changing its state to active.
When all the nodes have voted to halt, the computation stops; otherwise it can be
set a maximum number of iteration.
Edges don't have any computation.
When writing algorithms, you have to think as a vertex.
MILAN 20/21.11.2015 - Andrea Iacono
Pregel sample
Image source: Pregel paper
MILAN 20/21.11.2015 - Andrea Iacono
GraphX implementation of Pregel
GraphX uses three functions for implementing Pregel:
●
vprog: the vertex program computed for each vertex that receives the
incoming message and computes a new vertex value
●
sendMsg: the function used for sending messages to other vertices
●
mergeMsg: a function that takes two incoming messages and merges
them into a single message
Unlike Google's Pregel, GraphX implementation of Pregel:
●
leave the message construction out of the vertex-program, so to have
a more efficient distributed execution
●
permits access to both vertices attributes of an edge while building the
messages
●
contraints sending messages to graph structure (only to neighbours)
MILAN 20/21.11.2015 - Andrea Iacono
GraphX Pregel communication diagram
MILAN 20/21.11.2015 - Andrea Iacono
GraphX is well suited for algorithms that:
●
respect the neighborhood structure
GraphX is NOT well suited for algorithms that:
●
need iteration among distant vertices
●
change the structure of the graph
When to use GraphX
MILAN 20/21.11.2015 - Andrea Iacono
Algorithms out of the
box:
(as of Spark v1.5.1)
- Connected Components
- Label Propagation
- PageRank
- SVD++
- Shortest Paths
- Strongly Connected Components
- Triangle Count
MILAN 20/21.11.2015 - Andrea Iacono
Now some code!
MILAN 20/21.11.2015 - Andrea Iacono
Questions & Answers
MILAN 20/21.11.2015
Andrea Iacono
The code is available at:
https://github.com/andreaiacono/TalkGraphX
MILAN 20/21.11.2015 - Andrea Iacono
Leave your feedback on Joind.in!
https://m.joind.in/event/codemotion-milan-2015

More Related Content

PDF
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
PDF
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
PDF
GraphFrames: Graph Queries In Spark SQL
PDF
Machine Learning and GraphX
PDF
Graph Analytics in Spark
PDF
GraphX: Graph analytics for insights about developer communities
PPTX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
PPTX
Apache Spark GraphX highlights.
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries In Spark SQL
Machine Learning and GraphX
Graph Analytics in Spark
GraphX: Graph analytics for insights about developer communities
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Apache Spark GraphX highlights.

What's hot (20)

PDF
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
PPTX
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
PPT
Graph Analytics for big data
PDF
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
PDF
Signals from outer space
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PDF
Spark graphx
PDF
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PDF
GraphAware Framework Intro
PDF
Credit Fraud Prevention with Spark and Graph Analysis
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PDF
New Directions for Spark in 2015 - Spark Summit East
PDF
Graph-Powered Machine Learning
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PPTX
AMP Camp 5 Intro
PDF
Power of Polyglot Search
PPTX
Gephi, Graphx, and Giraph
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PDF
Congressional PageRank: Graph Analytics of US Congress With Neo4j
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Graph Analytics for big data
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Signals from outer space
GraphFrames: DataFrame-based graphs for Apache® Spark™
Spark graphx
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Apache Spark and the Emerging Technology Landscape for Big Data
GraphAware Framework Intro
Credit Fraud Prevention with Spark and Graph Analysis
An excursion into Graph Analytics with Apache Spark GraphX
New Directions for Spark in 2015 - Spark Summit East
Graph-Powered Machine Learning
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
AMP Camp 5 Intro
Power of Polyglot Search
Gephi, Graphx, and Giraph
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Ad

Viewers also liked (20)

PDF
Real time and reliable processing with Apache Storm
PDF
Graph Processing with Apache TinkerPop
PDF
Quantum Processes in Graph Computing
PDF
Titan: The Rise of Big Graph Data
PDF
Titan: Big Graph Data with Cassandra
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
PDF
Faunus: Graph Analytics Engine
PDF
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
PPTX
Using spark for timeseries graph analytics
PDF
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
PPTX
Neo, Titan & Cassandra
PDF
Titan: Scaling Graphs and TinkerPop3
PPT
Big Graph Analytics on Neo4j with Apache Spark
PDF
Graph processing - Powergraph and GraphX
PDF
Graph Processing with Titan and Scylla
PDF
The Pregel Programming Model with Spark GraphX
PPT
Graph Processing Applications @ HUG
PDF
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
PPTX
Improving personalized recommendations through temporal overlapping community...
PDF
Graph Sample and Hold: A Framework for Big Graph Analytics
Real time and reliable processing with Apache Storm
Graph Processing with Apache TinkerPop
Quantum Processes in Graph Computing
Titan: The Rise of Big Graph Data
Titan: Big Graph Data with Cassandra
Spark Concepts - Spark SQL, Graphx, Streaming
Faunus: Graph Analytics Engine
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Using spark for timeseries graph analytics
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Neo, Titan & Cassandra
Titan: Scaling Graphs and TinkerPop3
Big Graph Analytics on Neo4j with Apache Spark
Graph processing - Powergraph and GraphX
Graph Processing with Titan and Scylla
The Pregel Programming Model with Spark GraphX
Graph Processing Applications @ HUG
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Improving personalized recommendations through temporal overlapping community...
Graph Sample and Hold: A Framework for Big Graph Analytics
Ad

Similar to Graphs are everywhere! Distributed graph computing with Spark GraphX (20)

PDF
Andrea Iacono - Graphs are everywhere!
PPT
PDE2011 pythonOCC project status and plans
PPTX
mago3D FOSS4G NA 2018
PPTX
CS267_Graph_Lab
PPT
g-Eclipse Made Cloud Easy
PPT
g-Eclipse made Cloud Easy!
PDF
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
PPTX
Introduction to Aneka, Aneka Model is explained
PDF
CityEngine-OpenDS
PDF
Remix & GraphQL: A match made in heaven with type-safety DX
PDF
Upcoming features in Airflow 2
PDF
Introduction to spark 2.0
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PDF
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
PDF
GraphTech Ecosystem - part 3: Graph Visualization
PDF
Polyline download and visualization over terrain models
ODP
Map Reduce
PDF
STAF/ICGT 2018 Introduction to graph-oriented programming
PDF
Migrating to spark 2.0
PDF
State of GeoServer 2.10
Andrea Iacono - Graphs are everywhere!
PDE2011 pythonOCC project status and plans
mago3D FOSS4G NA 2018
CS267_Graph_Lab
g-Eclipse Made Cloud Easy
g-Eclipse made Cloud Easy!
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Introduction to Aneka, Aneka Model is explained
CityEngine-OpenDS
Remix & GraphQL: A match made in heaven with type-safety DX
Upcoming features in Airflow 2
Introduction to spark 2.0
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
GraphTech Ecosystem - part 3: Graph Visualization
Polyline download and visualization over terrain models
Map Reduce
STAF/ICGT 2018 Introduction to graph-oriented programming
Migrating to spark 2.0
State of GeoServer 2.10

Recently uploaded (20)

PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PPTX
Leprosy and NLEP programme community medicine
PPT
statistic analysis for study - data collection
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Managing Community Partner Relationships
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPT
Image processing and pattern recognition 2.ppt
DOCX
Factor Analysis Word Document Presentation
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
Leprosy and NLEP programme community medicine
statistic analysis for study - data collection
Navigating the Thai Supplements Landscape.pdf
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
retention in jsjsksksksnbsndjddjdnFPD.pptx
Managing Community Partner Relationships
[EN] Industrial Machine Downtime Prediction
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
IMPACT OF LANDSLIDE.....................
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Image processing and pattern recognition 2.ppt
Factor Analysis Word Document Presentation
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx

Graphs are everywhere! Distributed graph computing with Spark GraphX

  • 1. MILAN 20/21.11.2015 Graphs are everywhere! Distributed graph computing with Spark GraphX Andrea Iacono
  • 2. MILAN 20/21.11.2015 - Andrea Iacono Agenda: ● Graph definitions and usages ● GraphX introduction ● Pregel ● Code examples The main focus will be the programming model The code is available at: https://github.com/andreaiacono/TalkGraphX
  • 3. MILAN 20/21.11.2015 - Andrea Iacono A graph is a set of vertices and edges that connect them: Graphs are used for modeling very different domains. Edge Verte x
  • 4. MILAN 20/21.11.2015 - Andrea Iacono Network s
  • 5. MILAN 20/21.11.2015 - Andrea Iacono Routing
  • 6. MILAN 20/21.11.2015 - Andrea Iacono Page Rank
  • 7. MILAN 20/21.11.2015 - Andrea Iacono Definitions Undirected Directed
  • 8. MILAN 20/21.11.2015 - Andrea Iacono Definitions Connected Disconnected
  • 9. MILAN 20/21.11.2015 - Andrea Iacono Definitions K5 K2,3 Complete Bipartite (and complete)
  • 10. MILAN 20/21.11.2015 - Andrea Iacono Definitions Cyclic Acyclic
  • 11. MILAN 20/21.11.2015 - Andrea Iacono Definitions Multigraph Pseudograph
  • 12. MILAN 20/21.11.2015 - Andrea Iacono Definitions An undirected acyclic connected graph is a tree!
  • 13. MILAN 20/21.11.2015 - Andrea Iacono What's wrong with MapReduce? Every run of MapReduce reads from disk (e.g. HDFS) the initial data, computes the results and then stores them on disk; since most algorithms on graphs are iterative, this means that for every iteration the whole data must be read and written from/to disk. It's better to use a distributed dataflow framework
  • 14. MILAN 20/21.11.2015 - Andrea Iacono GraphX is a graph processing system built on top of Apache Spark “Graph processing systems represent graph structured data as a property graph, which associates user-defined properties with each vertex and edge.” “The Spark storage abstraction called Resilient Distributed Datasets (RDDs) enables applications to keep data in memory, which is essential for iterative graph algorithms.” “RDDs permit user-defined data partitioning, and the execution engine can exploit this to co-partition RDDs and co-schedule tasks to avoid data movement. This is essential for encoding partitioned graphs.” Excerpt from GraphX: Graph Processing in a Distributed Dataflow Framework https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf
  • 15. MILAN 20/21.11.2015 - Andrea Iacono GraphX / Spark software stack (image source: Spark site)
  • 16. MILAN 20/21.11.2015 - Andrea Iacono Graph Databases ● Storage ● Query Language ● Transactions ● Examples: ● Neo4j ● OrientDB ● Titan ● APIs for traversing and processing ● Better performance (in-memory data) ● Examples: ● GraphX ● Giraphe ● GraphLab Graph Processing Systems
  • 17. MILAN 20/21.11.2015 - Andrea Iacono Pregel is a computational model designed by Google (https://kowshik.github.io/JPregel/pregel_paper.pdf) It consists of a sequence of supersteps until termination. In each superstep, every vertex can: ● modify its state or the one of any of its neighbours ● receive the messages sent to it during the previous superstep ● send messages to its neighbours (that will be received in next superstep) ● vote to halt When a node votes to halt, it goes to inactive state; if in a later superstep it receives a message, the framework will awake it changing its state to active. When all the nodes have voted to halt, the computation stops; otherwise it can be set a maximum number of iteration. Edges don't have any computation. When writing algorithms, you have to think as a vertex.
  • 18. MILAN 20/21.11.2015 - Andrea Iacono Pregel sample Image source: Pregel paper
  • 19. MILAN 20/21.11.2015 - Andrea Iacono GraphX implementation of Pregel GraphX uses three functions for implementing Pregel: ● vprog: the vertex program computed for each vertex that receives the incoming message and computes a new vertex value ● sendMsg: the function used for sending messages to other vertices ● mergeMsg: a function that takes two incoming messages and merges them into a single message Unlike Google's Pregel, GraphX implementation of Pregel: ● leave the message construction out of the vertex-program, so to have a more efficient distributed execution ● permits access to both vertices attributes of an edge while building the messages ● contraints sending messages to graph structure (only to neighbours)
  • 20. MILAN 20/21.11.2015 - Andrea Iacono GraphX Pregel communication diagram
  • 21. MILAN 20/21.11.2015 - Andrea Iacono GraphX is well suited for algorithms that: ● respect the neighborhood structure GraphX is NOT well suited for algorithms that: ● need iteration among distant vertices ● change the structure of the graph When to use GraphX
  • 22. MILAN 20/21.11.2015 - Andrea Iacono Algorithms out of the box: (as of Spark v1.5.1) - Connected Components - Label Propagation - PageRank - SVD++ - Shortest Paths - Strongly Connected Components - Triangle Count
  • 23. MILAN 20/21.11.2015 - Andrea Iacono Now some code!
  • 24. MILAN 20/21.11.2015 - Andrea Iacono Questions & Answers
  • 25. MILAN 20/21.11.2015 Andrea Iacono The code is available at: https://github.com/andreaiacono/TalkGraphX
  • 26. MILAN 20/21.11.2015 - Andrea Iacono Leave your feedback on Joind.in! https://m.joind.in/event/codemotion-milan-2015

Editor's Notes

  • #3: Question to public: - Who knows what a graph is? - Who ever used it? - Who knows the most used algorithms? (BFS, DFS, Dijkstra) - Who knows Scala?
  • #4: Vertici e archi
  • #5: Conteggio dei triangoli x raggruppare Interesse commerciale x proposte mirate a gruppi con stessi interessi
  • #6: Vertici = incroci Archi = strade Algoritmo cammino minimo (Dijkstra), dove gli archi hanno più pesi: tipicamente distanza, traffico, pagamento di un pedaggio, etc
  • #7: Pagine = vertici Archi = link in entrata Ogni arco in uscita ha un pesao legato a quello del suo vertice; maggiore la sommatoria dei valori degli archi in ingresso, maggiore il peso del vertice. Algoritmo iterativo
  • #9: Orientato / non orientato
  • #10: Connesso / Non connesso
  • #11: K è la nomeclatura standard x indicare questo tipo di grafi A bipartite graph is useful for e-commerce, when you a all the user nodes that can buy any of the product nodes.
  • #12: Ciclico / Aciclico (o senza cicli)
  • #13: Multi grafo: quando si possono avere più archi che hanno la stessa sorgente e la stessa destinazione Pseudo grafo: quando un arco può avere lo stesso vertice come sorgente e come destinazione
  • #14: Quando dicevo che gli archi sono dappertutto, è soprattuto per questo!
  • #15: Qui si parla di grafi di grosse dimensioni, che non stanno nella RAM di un solo PC.
  • #16: Il grafo rappresentato è un multi-pseduo grafo. ????? rappresentazione interna?
  • #17: A differenza di spark, che offre le API in scala, Java e python, GraphX le offre solo in Scala; tuttavia in un prossimo futuro dovrebbero essere disponibili.
  • #19: Gremlin graph query language (tinkerpop) Gremlin is a DSL for traversing property graphs Neo4j uses (proprietary) cypher as native query language Titan a graph database che supporta come backend di storage: - cassandra (column) - hbase (column) - berkeleyDB (key-value)
  • #21: Immaginiamo di avere un valore per ogni vertice e di voler trovare il valore massimo di tutto il grafo. Con questo modello di computazione, l'idea è che dobbiamo propagare le informazioni fra i nodi. In ogni superstep, ogni vertice che ha ricevuto un valore più alto del suo, lo manda a tutti i suoi vicini. Quando nessun vertice cambia più, l'agoritmo è terminato.
  • #22: Commutativa: 2 + 3 == 3 + 2 Associativa: (2 + 3) + 4 = 2 + (3 + 4)
  • #32: Estrazione JetBrains