SlideShare a Scribd company logo
Introducing Apache Giraph for
 Large Scale Graph Processing

 Sebastian Schelter

 PhD student at the Database Systems and Information
 Management Group of TU Berlin

 Committer and PMC member at Apache Mahout and Apache Giraph

 mail ssc@apache.org blog http://ssc.io
Graph recap
graph: abstract representation of a set of objects
(vertices), where some pairs of these objects are
connected by links (edges), which can be directed or
undirected

Graphs can be used to model arbitrary things like
road networks, social networks, flows of goods, etc.

Majority of graph algorithms             B

are iterative and traverse
the graph in some way              A             D

                                             C
Real world graphs are really large!
• the World Wide Web has several billion pages
  with several billion links
• Facebook‘s social graph had more than 700
  million users and more than 68 billion
  friendships in 2011
• twitter‘s social graph has billions of follower
  relationships
Why not use MapReduce/Hadoop?
• Example: PageRank, Google‘s famous algorithm for
  measuring the authority of a webpage based on the
  underlying network of hyperlinks
• defined recursively: each vertex distributes its
  authority to its neighbors in equal proportions


                             pj
   pi                      dj
          j  ( j , i ) 
Textbook approach to
           PageRank in MapReduce
• PageRank p is the principal eigenvector of the Markov matrix
  M defined by the transition probabilities between web pages
• it can be obtained by iteratively multiplying an initial
  PageRank vector by M (power iteration)

    p  M p0
              k



       row 1 of M         ∙
       row 2 of M         ∙    pi
                                                            pi+1



       row n of M         ∙
Drawbacks
• Not intuitive: only crazy scientists
  think in matrices and eigenvectors
• Unnecessarily slow: Each iteration is scheduled
  as separate MapReduce job with lots of overhead
  –   the graph structure is read from disk
  –   the map output is spilled to disk
  –   the intermediary result is written to HDFS
• Hard to implement: a join has to be implemented
  by hand, lots of work, best strategy is data
  dependent
Google Pregel
• distributed system especially developed for
  large scale graph processing
• intuitive API that let‘s you ‚think like a vertex‘
• Bulk Synchronous Parallel (BSP) as execution
  model
• fault tolerance by checkpointing
Bulk Synchronous Parallel (BSP)
                    processors




local computation


                                 superstep



communication


barrier
synchronization
Vertex-centric BSP
• each vertex has an id, a value, a list of its adjacent vertex ids and the
  corresponding edge values
• each vertex is invoked in each superstep, can recompute its value and
  send messages to other vertices, which are delivered over superstep
  barriers
• advanced features : termination votes, combiners, aggregators, topology
  mutations
          vertex1                vertex1                 vertex1



          vertex2                vertex2                 vertex2



          vertex3                vertex3                 vertex3


        superstep i          superstep i + 1           superstep i + 2
Master-slave architecture
• vertices are partitioned and
  assigned to workers
   – default: hash-partitioning
   – custom partitioning possible



• master assigns and coordinates,
  while workers execute vertices               Master
  and communicate with each
  other

                                    Worker 1   Worker 2   Worker 3
PageRank in Pregel
class PageRankVertex {
 void compute(Iterator messages) {
   if (getSuperstep() > 0) {
     // recompute own PageRank from the neighbors messages
     pageRank = sum(messages);
                                                                                            pj
                                                                            
     setVertexValue(pageRank);
   }
                                                                  pi 
                                                                         j  ( j , i )    dj
     if (getSuperstep() < k) {
        // send updated PageRank to each neighbor
        sendMessageToAllNeighbors(pageRank / getNumOutEdges());
     } else {
       voteToHalt(); // terminate
     }
}}
PageRank toy example
      .17         .33
.33         .33         .33   Superstep 0
      .17         .17
            .17
                                                Input graph
      .25         .34
.17         .50         .34   Superstep 1   A       B         C
      .09         .25
            .09


      .22         .34
.25         .43         .34   Superstep 2
      .13         .22
            .13
Cool, where can I download it?
• Pregel is proprietary, but:
  – Apache Giraph is an open source
    implementation of Pregel
  – runs on standard Hadoop infrastructure
  – computation is executed in memory
  – can be a job in a pipeline (MapReduce, Hive)
  – uses Apache ZooKeeper for synchronization
Giraph‘s Hadoop usage

   TaskTracker        TaskTracker          TaskTracker

worker    worker   worker      worker   worker    worker




                                           TaskTracker

  ZooKeeper                             master    worker
                       JobTracker
                       NameNode
Anatomy of an execution
Setup                               Teardown
• load the graph from disk          • write back result
• assign vertices to workers        • write back aggregators
• validate workers health




Compute                        Synchronize
• assign messages to workers   • send messages to workers
• iterate on active vertices   • compute aggregators
• call vertices compute()      • checkpoint
Who is doing what?
• ZooKeeper: responsible for computation state
    – partition/worker mapping
    – global state: #superstep
    – checkpoint paths, aggregator values, statistics

• Master: responsible for coordination
    –   assigns partitions to workers
    –   coordinates synchronization
    –   requests checkpoints
    –   aggregates aggregator values
    –   collects health statuses

• Worker: responsible for vertices
    – invokes active vertices compute() function
    – sends, receives and assigns messages
    – computes local aggregation values
What do you have to implement?
• your algorithm as a Vertex
  – Subclass one of the many existing implementations:
    BasicVertex, MutableVertex, EdgeListVertex,
    HashMapVertex, LongDoubleFloatDoubleVertex,...
• a VertexInputFormat to read your graph
  – e.g. from a text file with adjacency lists like
    <vertex> <neighbor1> <neighbor2> ...
• a VertexOutputFormat to write back the result
  – e.g. <vertex> <pageRank>
Starting a Giraph job
• no difference to starting a Hadoop job:

  $ hadoop jar giraph-0.1-jar-with-dependencies.jar
  o.a.g.GiraphRunner o.a.g.examples.ConnectedComponentsVertex
  --inputFormat o.a.g.examples.IntIntNullIntTextInputFormat
  --inputPath hdfs:///wikipedia/pagelinks.txt
  --outputFormat o.a.g.examples.ComponentOutputFormat
  --outputPath hdfs:///wikipedia/results/
  --workers 89
  --combiner o.a.g.examples.MinimumIntCombiner
Introducing Apache Giraph for Large Scale Graph Processing
What‘s to come?
• Current and future work in Giraph
  – graduation from the incubator
  – out-of-core messaging
  – algorithms library

• 2-day workshop after Berlin Buzzwords
  – topic: ‚Parallel Processing beyond MapReduce‘
  – meet the developers of Giraph and Stratosphere
  http://berlinbuzzwords.de/content/workshops-berlin-buzzwords
Everything is a network!
Further resources
• Apache Giraph homepage
  http://incubator.apache.org/giraph


• Claudio Martella: “Apache Giraph: Distributed
  Graph Processing in the Cloud”
  http://prezi.com/9ake_klzwrga/apache-giraph-distributed-graph-
  processing-in-the-cloud/

• Malewicz et al.: „Pregel – a system for large scale
  graph processing“, PODC 09
  http://dl.acm.org/citation.cfm?id=1582723
Thank you.
Questions?

More Related Content

PPT
Giraph at Hadoop Summit 2014
PPTX
冬のLock free祭り safe
PPT
Java security
PDF
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
PDF
ZytleBot: ROSベースの自律移動ロボットへのFPGAの統合に向けて
PPT
Chapter 02 modified
PPTX
非同期処理の基礎
PDF
golang profiling の基礎
Giraph at Hadoop Summit 2014
冬のLock free祭り safe
Java security
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
ZytleBot: ROSベースの自律移動ロボットへのFPGAの統合に向けて
Chapter 02 modified
非同期処理の基礎
golang profiling の基礎

What's hot (20)

PPTX
SQLまで使える高機能NoSQLであるCouchbase Serverの勉強会資料
PPTX
Intelligence Artificielle et cybersécurité
PDF
Intel TSX について x86opti
PDF
Rapport atelier Web App Security 2015
PPTX
Implementing page rank algorithm using hadoop map reduce
PDF
Concurrency
PDF
Twitterのリアルタイム分散処理システム「Storm」入門
PPT
Snooping protocols 3
PDF
C++のSTLのコンテナ型を概観する @ Ohotech 特盛 #10(2014.8.30)
PDF
Ext4 filesystem(1)
PDF
SSE4.2の文字列処理命令の紹介
PDF
Pythonが動く仕組み(の概要)
PDF
はてなブックマークにおけるアクセス制御 - 半環構造に基づくモデル化
PPSX
Golang getting started
PDF
New Ways to Find Latency in Linux Using Tracing
PDF
今から始める Lens/Prism
PDF
Angular Framework présentation PPT LIGHT
PDF
게임서버프로그래밍 #4 - 멀티스레드 프로그래밍
PDF
Semi-Automatic Code Cleanup with Clang-Tidy
PPSX
Architecture des Systèmes Multi-Agents
SQLまで使える高機能NoSQLであるCouchbase Serverの勉強会資料
Intelligence Artificielle et cybersécurité
Intel TSX について x86opti
Rapport atelier Web App Security 2015
Implementing page rank algorithm using hadoop map reduce
Concurrency
Twitterのリアルタイム分散処理システム「Storm」入門
Snooping protocols 3
C++のSTLのコンテナ型を概観する @ Ohotech 特盛 #10(2014.8.30)
Ext4 filesystem(1)
SSE4.2の文字列処理命令の紹介
Pythonが動く仕組み(の概要)
はてなブックマークにおけるアクセス制御 - 半環構造に基づくモデル化
Golang getting started
New Ways to Find Latency in Linux Using Tracing
今から始める Lens/Prism
Angular Framework présentation PPT LIGHT
게임서버프로그래밍 #4 - 멀티스레드 프로그래밍
Semi-Automatic Code Cleanup with Clang-Tidy
Architecture des Systèmes Multi-Agents
Ad

Viewers also liked (14)

PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
PDF
Kudu - Fast Analytics on Fast Data
PPTX
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
PPTX
Introduction to Apache Kudu
PDF
Time Series Analysis with Spark
PPTX
Machine Learning with GraphLab Create
PDF
Apache kudu
PDF
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
PPTX
Hadoop Graph Processing with Apache Giraph
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
PPTX
HPE Keynote Hadoop Summit San Jose 2016
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Kudu - Fast Analytics on Fast Data
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Introduction to Apache Kudu
Time Series Analysis with Spark
Machine Learning with GraphLab Create
Apache kudu
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Hadoop Graph Processing with Apache Giraph
Apache Arrow (Strata-Hadoop World San Jose 2016)
HPE Keynote Hadoop Summit San Jose 2016
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Efficient Data Storage for Analytics with Apache Parquet 2.0
Next-generation Python Big Data Tools, powered by Apache Arrow
Ad

Similar to Introducing Apache Giraph for Large Scale Graph Processing (20)

PDF
Large Scale Graph Processing with Apache Giraph
PDF
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
PDF
Giraph+Gora in ApacheCon14
PPT
Hadoop classes in mumbai
PDF
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
PDF
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
PDF
Hive Anatomy
PDF
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
PDF
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
PDF
MapReduce basics
PPTX
Stream processing from single node to a cluster
PPTX
Hadoop fault tolerance
PDF
Giraph
PDF
hadoop
PDF
Apache Spark Overview part1 (20161107)
PDF
Apache Flink & Graph Processing
PPTX
Hadoop Architecture
PDF
Ling liu part 02:big graph processing
PPTX
Dive into spark2
PDF
Big data shim
Large Scale Graph Processing with Apache Giraph
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Giraph+Gora in ApacheCon14
Hadoop classes in mumbai
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Hive Anatomy
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
MapReduce basics
Stream processing from single node to a cluster
Hadoop fault tolerance
Giraph
hadoop
Apache Spark Overview part1 (20161107)
Apache Flink & Graph Processing
Hadoop Architecture
Ling liu part 02:big graph processing
Dive into spark2
Big data shim

More from sscdotopen (8)

PDF
Co-occurrence Based Recommendations with Mahout, Scala and Spark
PDF
Bringing Algebraic Semantics to Mahout
PDF
Next directions in Mahout's recommenders
PDF
New Directions in Mahout's Recommenders
PDF
Introduction to Collaborative Filtering with Apache Mahout
PDF
Scalable Similarity-Based Neighborhood Methods with MapReduce
PDF
Latent factor models for Collaborative Filtering
PDF
mahout-cf
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Bringing Algebraic Semantics to Mahout
Next directions in Mahout's recommenders
New Directions in Mahout's Recommenders
Introduction to Collaborative Filtering with Apache Mahout
Scalable Similarity-Based Neighborhood Methods with MapReduce
Latent factor models for Collaborative Filtering
mahout-cf

Recently uploaded (20)

PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
master seminar digital applications in india
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Introduction-to-Social-Work-by-Leonora-Serafeca-De-Guzman-Group-2.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Open Quiz Monsoon Mind Game Prelims.pptx
PPTX
Cell Structure & Organelles in detailed.
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Insiders guide to clinical Medicine.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Pharma ospi slides which help in ospi learning
PPTX
PPH.pptx obstetrics and gynecology in nursing
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
TR - Agricultural Crops Production NC III.pdf
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Anesthesia in Laparoscopic Surgery in India
master seminar digital applications in india
O5-L3 Freight Transport Ops (International) V1.pdf
Renaissance Architecture: A Journey from Faith to Humanism
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
102 student loan defaulters named and shamed – Is someone you know on the list?
Introduction-to-Social-Work-by-Leonora-Serafeca-De-Guzman-Group-2.pdf
Pre independence Education in Inndia.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Open Quiz Monsoon Mind Game Prelims.pptx
Cell Structure & Organelles in detailed.
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Insiders guide to clinical Medicine.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Pharma ospi slides which help in ospi learning
PPH.pptx obstetrics and gynecology in nursing

Introducing Apache Giraph for Large Scale Graph Processing

  • 1. Introducing Apache Giraph for Large Scale Graph Processing Sebastian Schelter PhD student at the Database Systems and Information Management Group of TU Berlin Committer and PMC member at Apache Mahout and Apache Giraph mail [email protected] blog http://ssc.io
  • 2. Graph recap graph: abstract representation of a set of objects (vertices), where some pairs of these objects are connected by links (edges), which can be directed or undirected Graphs can be used to model arbitrary things like road networks, social networks, flows of goods, etc. Majority of graph algorithms B are iterative and traverse the graph in some way A D C
  • 3. Real world graphs are really large! • the World Wide Web has several billion pages with several billion links • Facebook‘s social graph had more than 700 million users and more than 68 billion friendships in 2011 • twitter‘s social graph has billions of follower relationships
  • 4. Why not use MapReduce/Hadoop? • Example: PageRank, Google‘s famous algorithm for measuring the authority of a webpage based on the underlying network of hyperlinks • defined recursively: each vertex distributes its authority to its neighbors in equal proportions pj pi   dj j  ( j , i ) 
  • 5. Textbook approach to PageRank in MapReduce • PageRank p is the principal eigenvector of the Markov matrix M defined by the transition probabilities between web pages • it can be obtained by iteratively multiplying an initial PageRank vector by M (power iteration) p  M p0 k row 1 of M ∙ row 2 of M ∙ pi pi+1 row n of M ∙
  • 6. Drawbacks • Not intuitive: only crazy scientists think in matrices and eigenvectors • Unnecessarily slow: Each iteration is scheduled as separate MapReduce job with lots of overhead – the graph structure is read from disk – the map output is spilled to disk – the intermediary result is written to HDFS • Hard to implement: a join has to be implemented by hand, lots of work, best strategy is data dependent
  • 7. Google Pregel • distributed system especially developed for large scale graph processing • intuitive API that let‘s you ‚think like a vertex‘ • Bulk Synchronous Parallel (BSP) as execution model • fault tolerance by checkpointing
  • 8. Bulk Synchronous Parallel (BSP) processors local computation superstep communication barrier synchronization
  • 9. Vertex-centric BSP • each vertex has an id, a value, a list of its adjacent vertex ids and the corresponding edge values • each vertex is invoked in each superstep, can recompute its value and send messages to other vertices, which are delivered over superstep barriers • advanced features : termination votes, combiners, aggregators, topology mutations vertex1 vertex1 vertex1 vertex2 vertex2 vertex2 vertex3 vertex3 vertex3 superstep i superstep i + 1 superstep i + 2
  • 10. Master-slave architecture • vertices are partitioned and assigned to workers – default: hash-partitioning – custom partitioning possible • master assigns and coordinates, while workers execute vertices Master and communicate with each other Worker 1 Worker 2 Worker 3
  • 11. PageRank in Pregel class PageRankVertex { void compute(Iterator messages) { if (getSuperstep() > 0) { // recompute own PageRank from the neighbors messages pageRank = sum(messages); pj  setVertexValue(pageRank); } pi  j  ( j , i )  dj if (getSuperstep() < k) { // send updated PageRank to each neighbor sendMessageToAllNeighbors(pageRank / getNumOutEdges()); } else { voteToHalt(); // terminate } }}
  • 12. PageRank toy example .17 .33 .33 .33 .33 Superstep 0 .17 .17 .17 Input graph .25 .34 .17 .50 .34 Superstep 1 A B C .09 .25 .09 .22 .34 .25 .43 .34 Superstep 2 .13 .22 .13
  • 13. Cool, where can I download it? • Pregel is proprietary, but: – Apache Giraph is an open source implementation of Pregel – runs on standard Hadoop infrastructure – computation is executed in memory – can be a job in a pipeline (MapReduce, Hive) – uses Apache ZooKeeper for synchronization
  • 14. Giraph‘s Hadoop usage TaskTracker TaskTracker TaskTracker worker worker worker worker worker worker TaskTracker ZooKeeper master worker JobTracker NameNode
  • 15. Anatomy of an execution Setup Teardown • load the graph from disk • write back result • assign vertices to workers • write back aggregators • validate workers health Compute Synchronize • assign messages to workers • send messages to workers • iterate on active vertices • compute aggregators • call vertices compute() • checkpoint
  • 16. Who is doing what? • ZooKeeper: responsible for computation state – partition/worker mapping – global state: #superstep – checkpoint paths, aggregator values, statistics • Master: responsible for coordination – assigns partitions to workers – coordinates synchronization – requests checkpoints – aggregates aggregator values – collects health statuses • Worker: responsible for vertices – invokes active vertices compute() function – sends, receives and assigns messages – computes local aggregation values
  • 17. What do you have to implement? • your algorithm as a Vertex – Subclass one of the many existing implementations: BasicVertex, MutableVertex, EdgeListVertex, HashMapVertex, LongDoubleFloatDoubleVertex,... • a VertexInputFormat to read your graph – e.g. from a text file with adjacency lists like <vertex> <neighbor1> <neighbor2> ... • a VertexOutputFormat to write back the result – e.g. <vertex> <pageRank>
  • 18. Starting a Giraph job • no difference to starting a Hadoop job: $ hadoop jar giraph-0.1-jar-with-dependencies.jar o.a.g.GiraphRunner o.a.g.examples.ConnectedComponentsVertex --inputFormat o.a.g.examples.IntIntNullIntTextInputFormat --inputPath hdfs:///wikipedia/pagelinks.txt --outputFormat o.a.g.examples.ComponentOutputFormat --outputPath hdfs:///wikipedia/results/ --workers 89 --combiner o.a.g.examples.MinimumIntCombiner
  • 20. What‘s to come? • Current and future work in Giraph – graduation from the incubator – out-of-core messaging – algorithms library • 2-day workshop after Berlin Buzzwords – topic: ‚Parallel Processing beyond MapReduce‘ – meet the developers of Giraph and Stratosphere http://berlinbuzzwords.de/content/workshops-berlin-buzzwords
  • 21. Everything is a network!
  • 22. Further resources • Apache Giraph homepage http://incubator.apache.org/giraph • Claudio Martella: “Apache Giraph: Distributed Graph Processing in the Cloud” http://prezi.com/9ake_klzwrga/apache-giraph-distributed-graph- processing-in-the-cloud/ • Malewicz et al.: „Pregel – a system for large scale graph processing“, PODC 09 http://dl.acm.org/citation.cfm?id=1582723