SlideShare a Scribd company logo
Algorithmic optimizations for Dynamic Levelwise
PageRank (from STICD)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try
reducing the work per iteration, and the other is to try reducing the number of iterations.
These goals are often at odds with one another. Skipping computation on vertices which
have already converged has the potential to save iteration time. Skipping in-identical
vertices, with the same in-links, helps reduce duplicate computations and thus could help
reduce iteration time. Road networks often have chains which can be short-circuited before
pagerank computation to improve performance. Final ranks of chain nodes can be easily
calculated. This could reduce both the iteration time, and the number of iterations. If a graph
has no dangling nodes, pagerank of each strongly connected component can be
computed in topological order. This could help reduce the iteration time, no. of iterations,
and also enable multi-iteration concurrency in pagerank computation. The combination of all
of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged
components whose ranks are unaffected can be skipped altogether.
Before starting any algorithmic optimization, a good monolithic pagerank implementation
needs to be set up. There are two ways (algorithmically) to think of the pagerank calculation.
One approach (push) is to find pagerank by pushing contributions to out-vertices. The push
method is somewhat easier to implement, and is described in this lecture. With this
approach, in an iteration for each vertex, the ranks of vertices connected to its outgoing edge
are cumulated with p×rn, where p is the damping factor (0.85), and rn is the rank of the
(source) vertex in the previous iteration. But, if a vertex has no out-going edges, it is
considered to have out-going edges to all vertices in the graph (including itself). This is
because a random surfer jumps to a random page upon visiting a page with no links, in order
to avoid the rank-sink effect. However, it requires multiple writes per source vertex, due to
the cumulation (+=) operation. The other approach (pull) is to pull contributions from
in-vertices. Here, the rank of a vertex in an iteration is calculated as c0 + pΣrn/dn, where c0 is
the common teleport contribution, p is the damping factor (0.85), rn is the previous rank of
vertex with an incoming edge, dn is the out-degree of the incoming-edge vertex, and N is the
total number of vertices in the graph. The common teleport contribution c0, calculated as
(1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex in the graph due
to the damping factor (1-p)/N, and teleport from dangling vertices (with no outgoing edges) in
the graph pΣrn/N (to avoid the rank-sink effect). This approach requires 2 additional
calculations per-vertex, i.e., non-teleport contribution of each vertex, and total teleport
contribution (to all vertices). However, it requires only 1 write per destination vertex. For this
experiment both of these approaches are assessed on a number of different graphs.
All graphs used are stored in the MatrixMarket (.mtx) file format, and obtained from the
SuiteSparse Matrix Collection. The experiment is implemented in C++, and compiled using
GCC 9 with optimization level 3 (-O3). The system used is a Dell PowerEdge R740 Rack
server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4 Synchronous
Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running CentOS Linux release
7.9.2009 (Core). The execution time of each test case is measured using
std::chrono::high_performance_timer. This is done 5 times for each test case, and timings
are averaged. Statistics of each test case is printed to standard output (stdout), and
redirected to a log file, which is then processed with a script to generate a CSV file, with
each row representing the details of a single test case. This CSV file is imported into Google
Sheets, and necessary tables are set up with the help of the FILTER function to create the
charts.
While it might seem that the pull method would be a clear winner, the results indicate that
although pull is always faster than push approach, the difference between the two depends
on the nature of the graph. The next step is to compare the performance between finding
pagerank using C++ DiGraph class directly (using arrays of edge-lists) vs its CSR
(Compressed Sparse Row) representation (contiguous). Using a CSR representation has
the potential for performance improvement due to information on vertices and edges being
stored contiguously.
Adjusting Monolithic (Sequential) approach
Push Pull Class CSR
1. Performance of contribution-push based vs contribution-pull based PageRank.
2. Performance of C++ DiGraph class based vs CSR based PageRank (pull).
Next an experiment is conducted to assess the performance benefit of each algorithmic
optimization separately. For splitting graph by components optimization, the following
approaches are compared: pagerank without optimization, pagerank with vertices split by
components, and finally pagerank with components sorted in topological order. Components
of the graph are obtained using Kosaraju’s algorithm. Topological ordering is done by
representing the graph as a block-graph, where each component is represented as a vertex,
and cross-edges between components are represented as edges. This block-graph is then
topologically sorted, and this vertex-order in block-graph is used to reorder the components
in topological order. Vertices, and their respective edges are accordingly simply reordered
before computing pagerank (no graph partitioning is done). Each approach was attempted
on a number of graphs. On a few graphs, splitting vertices by components provides a
speedup, but sorting components in topological order provides no additional speedup. For
road networks, like germany_osm which only have one component, the speedup is possibly
because of the vertex reordering caused by dfs() which is required for splitting by
components. For skipping in-identicals optimization, comparison is done with unoptimized
pagerank. In-identical vertices are obtained by scanning matching edges of a vertex by
in-vertices hash. Except the first in-identical vertex of an in-identicals-group, remaining
vertices are skipped during pagerank computation. After each iteration ends, rank of the first
in-identical vertex is copied to the remaining vertices of the in-identicals group. The vertices
to be skipped are marked with negative source-offset in CSR. On indochina-2004 graph,
skipping in-identicals provides a speedup of ~1.8, but on average provides no speedup for
other graphs. This is likely due to the fact that the graph indochina-2004 has a large number
of in-identicals and in-identical groups. Although, it doesn't have the highest in-identicals %
or the highest avg. in-identical group size. For skipping chains optimization, comparison is
done with unoptimized pagerank. It is important to note that a chain here means a set of
unidirectional links connecting one vertex to the next, without any additional edges.
Bi-directional links are not considered as chains. Chain vertices are obtained by traversing
2-degree vertices in both directions and marking visited ones. Except the first chain vertex of
a chains-group, remaining vertices are skipped during pagerank computation. After each
iteration ends, ranks of the remaining vertices in each chains-group is updated using the
(GP) formula c0×(1-pn
)/(1-p) + pn
×r, where c0 is the common teleport contribution, p is the
damping factor, n is the distance from the first chain vertex, and r is the rank of the first chain
vertex in previous iteration. The vertices to be skipped are marked with negative
source-offset in CSR. On average, skipping chain vertices provides no speedup. This is
likely because most graphs don't have enough chains to provide an advantage. Road
networks do have chains, but they are bi-directional, and thus not considered here. For
skipping converged vertices optimization, the following approaches are compared:
pagerank without optimization, pagerank skipping converged vertices with re-check (in 2-16
turns), and pagerank skipping converged vertices after several turns (in 2-64 turns). Skip
with re-check (skip-check) approach skips the current iteration for a vertex if its rank for the
last two iterations match, and the current turn (iteration) is not a “check” turn. The check turn
is adjusted between 2-16 turns. Skip after turns (skip-after) skips all future iterations of a
vertex after its rank does not change for “after” turns. The after turns are adjusted between
2-64 turns. On average, neither skip-check, nor skip-after gives better speed than the default
(unoptimized) approach. This could be due to the unnecessary iterations added by
skip-check (mistakenly skipped), and increased memory accesses performed by skip-after
(tracking converged count).
Adjusting Monolithic optimizations (from STICD)
Split components Skip in-identicals Skip chains Skip converged
1. Performance benefit of PageRank with vertices split by components (pull, CSR).
2. Performance benefit of skipping in-identical vertices for PageRank (pull, CSR).
3. Performance benefit of skipping chain vertices for PageRank (pull, CSR).
4. Performance benefit of skipping converged vertices for PageRank (pull, CSR).
This experiment was for comparing performance between levelwise pagerank with various
min. compute size, ranging from 1 - 1E+7. Here, min. compute size is the minimum no.
nodes of each pagerank compute using standard algorithm (monolithic). Each min. compute
size was attempted on different types of graphs, running each size 5 times per graph to get a
good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD optimizations
(using single-thread). Although there is no clear winner, it appears a min. compute size of 10
would be a good choice. Note that the levelwise approach does not make use of SIMD
instructions which are available on all modern hardware.
This experiment was for comparing performance between: monolithic pagerank, monolithic
pagerank skipping teleport, levelwise pagerank, levelwise pagerank skipping teleport. Each
approach was attempted on different types of graphs, running each approach 5 times per
graph to get a good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD
optimizations (using single-thread).
Except for soc-LiveJournal1 and coPapersCiteseer, in all cases skipping teleport calculations
is slightly faster (could be a random issue for the two). The improvement is most prominent
in case of road networks and certain web graphs.
Adjusting Levelwise (STICD) approach
Min. component size Min. compute size Skip teleport calculation
1. Comparing various min. component sizes for topologically-ordered components (levelwise...).
2. Comparing various min. compute sizes for topologically-ordered components (levelwise...).
3. Checking performance benefit of levelwise PageRank when teleport calculation is skipped.
Note: min. components size merges small components even before generating block-graph /
topological-ordering, but min. compute size does it before pagerank computation.
This experiment was for comparing performance between: pagerank with standard
algorithm (monolithic), pagerank in topologically-ordered components fashion (levelwise).
Both approaches were attempted on different types of graphs, running each approach 5
times per graph to get a good time measure. Levelwise pagerank is the STIC-D algorithm,
without ICD optimizations (using single-thread).
On average, levelwise pagerank is faster than the monolithic approach. Note that neither
approach makes use of SIMD instructions which are available on all modern hardware.
Comparing Levelwise (STICD) approach
Monolithic nvGraph
Levelwise (STICD) vs
1. Performance of monolithic vs topologically-ordered components (levelwise) PageRank.
This experiment was for comparing performance between: static levelwise pagerank,
dynamic levelwise pagerank (process all components), dynamic levelwise pagerank skipping
unchanged components. Each approach was attempted on a number of graphs (fixed and
temporal), running each with multiple batch sizes (1, 5, 10, 50, ...). Levelwise pagerank is the
STIC-D algorithm, without ICD optimizations (using single-thread).
On average, skipping unchanged components is barely faster than not skipping.
Adjusting Levelwise (STICD) dynamic approach
Skip unaffected components For fixed graphs For temporal graphs
1. Checking for correctness of levelwise PageRank when unchanged components are skipped.
2. Perf. benefit of levelwise PageRank when unchanged components are skipped (fixed).
3. Perf. benefit of levelwise PageRank when unchanged components are skipped (temporal).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
This experiment was for comparing performance between: static pagerank using standard
algorithm (monolithic), static pagerank using levelwise algorithm, dynamic pagerank using
levelwise algorithm. Each approach was attempted on a number of graphs, running each
with multiple batch sizes (1, 5, 10, 50, ...). Each pagerank computation was run 5 times for
both approaches to get a good time measure. Levelwise pagerank is the STIC-D algorithm,
without ICD optimizations (using single-thread).
Clearly, dynamic levelwise pagerank is faster than the static approach for many batch sizes.
Comparing dynamic approach with static
nvGraph dynamic Monolithic dynamic Levelwise dynamic
nvGraph static vs: temporal
Monolithic static vs: fixed, temporal vs: fixed, temporal
Levelwise static vs: fixed vs: fixed, temporal
1. Performance of nvGraph based static vs dynamic PageRank (temporal).
2. Performance of static vs dynamic PageRank (temporal).
3. Performance of static vs dynamic levelwise PageRank (fixed).
4. Performance of levelwise based static vs dynamic PageRank (temporal).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
This experiment was for comparing performance between levelwise CUDA pagerank with
various min. compute size, ranging from 1E+3 - 1E+7. Here, min. compute size is the
minimum no. nodes of each pagerank compute using standard algorithm (monolithic CUDA).
Each min. compute size was attempted on different types of graphs, running each size 5
times per graph to get a good time measure. Levelwise pagerank is the STIC-D algorithm,
without ICD optimizations (using single-thread).
Although there is no clear winner, it appears a min. compute size of 5E+6 would be a good
choice.
Adjusting Levelwise (STICD) CUDA approach
Min. component size Min. compute size Skip teleport calculation
1. Min. component sizes for topologically-ordered components (levelwise, CUDA) PageRank.
2. Min. compute sizes for topologically-ordered components (levelwise CUDA) PageRank.
Note: min. components size merges small components even before generating block-graph /
topological-ordering, but min. compute size does it before pagerank computation.
This experiment was for comparing performance between: CUDA based pagerank with
standard algorithm (monolithic), CUDA based pagerank in topologically-ordered components
fashion (levelwise). Both approaches were attempted on different types of graphs, running
each approach 5 times per graph to get a good time measure. Levelwise pagerank is the
STIC-D algorithm, without ICD optimizations (using single-thread).
On average, levelwise pagerank is the same performance as the monolithic approach.
Comparing Levelwise (STICD) CUDA approach
nvGraph Monolithic CUDA
Monolithic vs vs
Monolithic CUDA vs
Levelwise CUDA vs vs
1. Performance of sequential execution based vs CUDA based PageRank (pull, CSR).
2. Performance of nvGraph vs CUDA based PageRank (pull, CSR).
3. Performance of Monolithic CUDA vs Levelwise CUDA PageRank (pull, CSR, ...).
This experiment was for comparing the performance between: static pagerank of updated
graph, dynamic pagerank of updated graph. Both techniques were attempted on different
temporal graphs, updating each graph with multiple batch sizes (1, 5, 10, 50, ...). New edges
are incrementally added to the graph batch-by-batch until the entire graph is complete.
Dynamic pagerank is clearly faster than static approach for many batch sizes.
Comparing dynamic CUDA approach with static
nvGraph dynamic Monolithic dynamic Levelwise dynamic
nvGraph static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
Monolithic static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
Levelwise static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
1. Performance of static vs dynamic CUDA based PageRank (fixed).
2. Performance of static vs dynamic CUDA based PageRank (temporal).
3. Performance of CUDA based static vs dynamic levelwise PageRank (fixed).
4. Performance of static vs dynamic CUDA based levelwise PageRank (temporal).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
This experiment was for comparing performance between: static pagerank of updated graph
using nvGraph, dynamic pagerank of updated graph using nvGraph, static monolithic CUDA
based pagerank of updated graph, dynamic monolithic CUDA based pagerank of updated
graph, static levelwise CUDA based pagerank of updated graph, dynamic levelwise CUDA
based pagerank of updated graph. Each approach was attempted on a number of graphs,
running each with multiple batch sizes (1, 5, 10, 50, ...). Each batch size was run with 5
different updates to the graph, and each specific update was run 5 times for each approach
to get a good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD
optimizations.
Indeed, dynamic levelwise pagerank is faster than the static approach for many batch sizes.
In order to measure error, nvGraph pagerank is taken as a reference.
Comparing dynamic optimized CUDA approach with static
nvGraph dynamic Monolithic dynamic Levelwise dynamic
nvGraph static vs: fixed vs: fixed vs: fixed
Monolithic static vs: fixed vs: fixed vs: fixed
Levelwise static vs: fixed vs: fixed vs: fixed
1. Performance of CUDA based optimized dynamic monolithic vs levelwise PageRank (fixed).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.

More Related Content

PDF
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
PDF
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
PDF
Rank adjustment strategies for Dynamic PageRank : REPORT
PDF
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
PDF
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
PDF
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
PDF
PageRank Experiments : SHORT REPORT / NOTES
PDF
STIC-D: algorithmic techniques for efficient parallel pagerank computation on...
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
Rank adjustment strategies for Dynamic PageRank : REPORT
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
PageRank Experiments : SHORT REPORT / NOTES
STIC-D: algorithmic techniques for efficient parallel pagerank computation on...

Similar to Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT REPORT / NOTES (20)

PDF
Incremental Page Rank Computation on Evolving Graphs : NOTES
PDF
Benchmarking tool for graph algorithms
PDF
I/O-Efficient Techniques for Computing Pagerank : NOTES
PDF
Benchmarking Tool for Graph Algorithms
PDF
Adjusting PageRank parameters and comparing results : REPORT
PDF
Adjusting Bitset for graph : SHORT REPORT / NOTES
PPTX
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
PDF
Adjusting PageRank parameters and comparing results : REPORT
PDF
50120140506002
PDF
Incremental Page Rank Computation on Evolving Graphs
PPTX
ppt 1.pptx
PDF
Adjusting PageRank parameters and comparing results : REPORT
PPTX
UNIT III.pptx
PDF
Graph Data Structure
PPTX
Data Structures and Agorithm: DS 21 Graph Theory.pptx
PDF
graph representation.pdf
PPTX
GRAPH - DISCRETE STRUCTURE AND ALGORITHM
PPT
Data Structures-Non Linear DataStructures-Graphs
Incremental Page Rank Computation on Evolving Graphs : NOTES
Benchmarking tool for graph algorithms
I/O-Efficient Techniques for Computing Pagerank : NOTES
Benchmarking Tool for Graph Algorithms
Adjusting PageRank parameters and comparing results : REPORT
Adjusting Bitset for graph : SHORT REPORT / NOTES
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
Adjusting PageRank parameters and comparing results : REPORT
50120140506002
Incremental Page Rank Computation on Evolving Graphs
ppt 1.pptx
Adjusting PageRank parameters and comparing results : REPORT
UNIT III.pptx
Graph Data Structure
Data Structures and Agorithm: DS 21 Graph Theory.pptx
graph representation.pdf
GRAPH - DISCRETE STRUCTURE AND ALGORITHM
Data Structures-Non Linear DataStructures-Graphs
Ad

More from Subhajit Sahu (20)

PDF
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
PDF
Adjusting primitives for graph : SHORT REPORT / NOTES
PDF
Experiments with Primitive operations : SHORT REPORT / NOTES
PDF
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
PDF
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
PDF
Shared memory Parallelism (NOTES)
PDF
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
PDF
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
PDF
Application Areas of Community Detection: A Review : NOTES
PDF
Community Detection on the GPU : NOTES
PDF
Survey for extra-child-process package : NOTES
PDF
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
PDF
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
PDF
Fast Incremental Community Detection on Dynamic Graphs : NOTES
PDF
Can you fix farming by going back 8000 years : NOTES
PDF
HITS algorithm : NOTES
PDF
Basic Computer Architecture and the Case for GPUs : NOTES
PDF
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
PDF
Are Satellites Covered in Gold Foil : NOTES
PDF
Taxation for Traders < Markets and Taxation : NOTES
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
Adjusting primitives for graph : SHORT REPORT / NOTES
Experiments with Primitive operations : SHORT REPORT / NOTES
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
Shared memory Parallelism (NOTES)
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Application Areas of Community Detection: A Review : NOTES
Community Detection on the GPU : NOTES
Survey for extra-child-process package : NOTES
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Can you fix farming by going back 8000 years : NOTES
HITS algorithm : NOTES
Basic Computer Architecture and the Case for GPUs : NOTES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Are Satellites Covered in Gold Foil : NOTES
Taxation for Traders < Markets and Taxation : NOTES
Ad

Recently uploaded (20)

PPTX
Introduction to Knowledge Engineering Part 1
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Lecture1 pattern recognition............
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Quality review (1)_presentation of this 21
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to machine learning and Linear Models
Introduction to Knowledge Engineering Part 1
Clinical guidelines as a resource for EBP(1).pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Reliability_Chapter_ presentation 1221.5784
STUDY DESIGN details- Lt Col Maksud (21).pptx
1_Introduction to advance data techniques.pptx
climate analysis of Dhaka ,Banglades.pptx
Lecture1 pattern recognition............
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Supervised vs unsupervised machine learning algorithms
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Fluorescence-microscope_Botany_detailed content
Galatica Smart Energy Infrastructure Startup Pitch Deck
Quality review (1)_presentation of this 21
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to machine learning and Linear Models

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT REPORT / NOTES

  • 1. Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether. Before starting any algorithmic optimization, a good monolithic pagerank implementation needs to be set up. There are two ways (algorithmically) to think of the pagerank calculation. One approach (push) is to find pagerank by pushing contributions to out-vertices. The push method is somewhat easier to implement, and is described in this lecture. With this approach, in an iteration for each vertex, the ranks of vertices connected to its outgoing edge are cumulated with p×rn, where p is the damping factor (0.85), and rn is the rank of the (source) vertex in the previous iteration. But, if a vertex has no out-going edges, it is considered to have out-going edges to all vertices in the graph (including itself). This is because a random surfer jumps to a random page upon visiting a page with no links, in order to avoid the rank-sink effect. However, it requires multiple writes per source vertex, due to the cumulation (+=) operation. The other approach (pull) is to pull contributions from in-vertices. Here, the rank of a vertex in an iteration is calculated as c0 + pΣrn/dn, where c0 is the common teleport contribution, p is the damping factor (0.85), rn is the previous rank of vertex with an incoming edge, dn is the out-degree of the incoming-edge vertex, and N is the total number of vertices in the graph. The common teleport contribution c0, calculated as (1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex in the graph due to the damping factor (1-p)/N, and teleport from dangling vertices (with no outgoing edges) in the graph pΣrn/N (to avoid the rank-sink effect). This approach requires 2 additional calculations per-vertex, i.e., non-teleport contribution of each vertex, and total teleport contribution (to all vertices). However, it requires only 1 write per destination vertex. For this experiment both of these approaches are assessed on a number of different graphs. All graphs used are stored in the MatrixMarket (.mtx) file format, and obtained from the SuiteSparse Matrix Collection. The experiment is implemented in C++, and compiled using GCC 9 with optimization level 3 (-O3). The system used is a Dell PowerEdge R740 Rack server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4 Synchronous Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running CentOS Linux release 7.9.2009 (Core). The execution time of each test case is measured using std::chrono::high_performance_timer. This is done 5 times for each test case, and timings
  • 2. are averaged. Statistics of each test case is printed to standard output (stdout), and redirected to a log file, which is then processed with a script to generate a CSV file, with each row representing the details of a single test case. This CSV file is imported into Google Sheets, and necessary tables are set up with the help of the FILTER function to create the charts. While it might seem that the pull method would be a clear winner, the results indicate that although pull is always faster than push approach, the difference between the two depends on the nature of the graph. The next step is to compare the performance between finding pagerank using C++ DiGraph class directly (using arrays of edge-lists) vs its CSR (Compressed Sparse Row) representation (contiguous). Using a CSR representation has the potential for performance improvement due to information on vertices and edges being stored contiguously. Adjusting Monolithic (Sequential) approach Push Pull Class CSR 1. Performance of contribution-push based vs contribution-pull based PageRank. 2. Performance of C++ DiGraph class based vs CSR based PageRank (pull). Next an experiment is conducted to assess the performance benefit of each algorithmic optimization separately. For splitting graph by components optimization, the following approaches are compared: pagerank without optimization, pagerank with vertices split by components, and finally pagerank with components sorted in topological order. Components of the graph are obtained using Kosaraju’s algorithm. Topological ordering is done by representing the graph as a block-graph, where each component is represented as a vertex, and cross-edges between components are represented as edges. This block-graph is then topologically sorted, and this vertex-order in block-graph is used to reorder the components in topological order. Vertices, and their respective edges are accordingly simply reordered before computing pagerank (no graph partitioning is done). Each approach was attempted on a number of graphs. On a few graphs, splitting vertices by components provides a speedup, but sorting components in topological order provides no additional speedup. For road networks, like germany_osm which only have one component, the speedup is possibly because of the vertex reordering caused by dfs() which is required for splitting by components. For skipping in-identicals optimization, comparison is done with unoptimized pagerank. In-identical vertices are obtained by scanning matching edges of a vertex by in-vertices hash. Except the first in-identical vertex of an in-identicals-group, remaining vertices are skipped during pagerank computation. After each iteration ends, rank of the first in-identical vertex is copied to the remaining vertices of the in-identicals group. The vertices to be skipped are marked with negative source-offset in CSR. On indochina-2004 graph, skipping in-identicals provides a speedup of ~1.8, but on average provides no speedup for other graphs. This is likely due to the fact that the graph indochina-2004 has a large number of in-identicals and in-identical groups. Although, it doesn't have the highest in-identicals % or the highest avg. in-identical group size. For skipping chains optimization, comparison is
  • 3. done with unoptimized pagerank. It is important to note that a chain here means a set of unidirectional links connecting one vertex to the next, without any additional edges. Bi-directional links are not considered as chains. Chain vertices are obtained by traversing 2-degree vertices in both directions and marking visited ones. Except the first chain vertex of a chains-group, remaining vertices are skipped during pagerank computation. After each iteration ends, ranks of the remaining vertices in each chains-group is updated using the (GP) formula c0×(1-pn )/(1-p) + pn ×r, where c0 is the common teleport contribution, p is the damping factor, n is the distance from the first chain vertex, and r is the rank of the first chain vertex in previous iteration. The vertices to be skipped are marked with negative source-offset in CSR. On average, skipping chain vertices provides no speedup. This is likely because most graphs don't have enough chains to provide an advantage. Road networks do have chains, but they are bi-directional, and thus not considered here. For skipping converged vertices optimization, the following approaches are compared: pagerank without optimization, pagerank skipping converged vertices with re-check (in 2-16 turns), and pagerank skipping converged vertices after several turns (in 2-64 turns). Skip with re-check (skip-check) approach skips the current iteration for a vertex if its rank for the last two iterations match, and the current turn (iteration) is not a “check” turn. The check turn is adjusted between 2-16 turns. Skip after turns (skip-after) skips all future iterations of a vertex after its rank does not change for “after” turns. The after turns are adjusted between 2-64 turns. On average, neither skip-check, nor skip-after gives better speed than the default (unoptimized) approach. This could be due to the unnecessary iterations added by skip-check (mistakenly skipped), and increased memory accesses performed by skip-after (tracking converged count). Adjusting Monolithic optimizations (from STICD) Split components Skip in-identicals Skip chains Skip converged 1. Performance benefit of PageRank with vertices split by components (pull, CSR). 2. Performance benefit of skipping in-identical vertices for PageRank (pull, CSR). 3. Performance benefit of skipping chain vertices for PageRank (pull, CSR). 4. Performance benefit of skipping converged vertices for PageRank (pull, CSR). This experiment was for comparing performance between levelwise pagerank with various min. compute size, ranging from 1 - 1E+7. Here, min. compute size is the minimum no. nodes of each pagerank compute using standard algorithm (monolithic). Each min. compute size was attempted on different types of graphs, running each size 5 times per graph to get a good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD optimizations (using single-thread). Although there is no clear winner, it appears a min. compute size of 10 would be a good choice. Note that the levelwise approach does not make use of SIMD instructions which are available on all modern hardware. This experiment was for comparing performance between: monolithic pagerank, monolithic pagerank skipping teleport, levelwise pagerank, levelwise pagerank skipping teleport. Each approach was attempted on different types of graphs, running each approach 5 times per
  • 4. graph to get a good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD optimizations (using single-thread). Except for soc-LiveJournal1 and coPapersCiteseer, in all cases skipping teleport calculations is slightly faster (could be a random issue for the two). The improvement is most prominent in case of road networks and certain web graphs. Adjusting Levelwise (STICD) approach Min. component size Min. compute size Skip teleport calculation 1. Comparing various min. component sizes for topologically-ordered components (levelwise...). 2. Comparing various min. compute sizes for topologically-ordered components (levelwise...). 3. Checking performance benefit of levelwise PageRank when teleport calculation is skipped. Note: min. components size merges small components even before generating block-graph / topological-ordering, but min. compute size does it before pagerank computation. This experiment was for comparing performance between: pagerank with standard algorithm (monolithic), pagerank in topologically-ordered components fashion (levelwise). Both approaches were attempted on different types of graphs, running each approach 5 times per graph to get a good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD optimizations (using single-thread). On average, levelwise pagerank is faster than the monolithic approach. Note that neither approach makes use of SIMD instructions which are available on all modern hardware. Comparing Levelwise (STICD) approach Monolithic nvGraph Levelwise (STICD) vs 1. Performance of monolithic vs topologically-ordered components (levelwise) PageRank. This experiment was for comparing performance between: static levelwise pagerank, dynamic levelwise pagerank (process all components), dynamic levelwise pagerank skipping unchanged components. Each approach was attempted on a number of graphs (fixed and temporal), running each with multiple batch sizes (1, 5, 10, 50, ...). Levelwise pagerank is the STIC-D algorithm, without ICD optimizations (using single-thread). On average, skipping unchanged components is barely faster than not skipping.
  • 5. Adjusting Levelwise (STICD) dynamic approach Skip unaffected components For fixed graphs For temporal graphs 1. Checking for correctness of levelwise PageRank when unchanged components are skipped. 2. Perf. benefit of levelwise PageRank when unchanged components are skipped (fixed). 3. Perf. benefit of levelwise PageRank when unchanged components are skipped (temporal). Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge updated from temporal graphs. This experiment was for comparing performance between: static pagerank using standard algorithm (monolithic), static pagerank using levelwise algorithm, dynamic pagerank using levelwise algorithm. Each approach was attempted on a number of graphs, running each with multiple batch sizes (1, 5, 10, 50, ...). Each pagerank computation was run 5 times for both approaches to get a good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD optimizations (using single-thread). Clearly, dynamic levelwise pagerank is faster than the static approach for many batch sizes. Comparing dynamic approach with static nvGraph dynamic Monolithic dynamic Levelwise dynamic nvGraph static vs: temporal Monolithic static vs: fixed, temporal vs: fixed, temporal Levelwise static vs: fixed vs: fixed, temporal 1. Performance of nvGraph based static vs dynamic PageRank (temporal). 2. Performance of static vs dynamic PageRank (temporal). 3. Performance of static vs dynamic levelwise PageRank (fixed). 4. Performance of levelwise based static vs dynamic PageRank (temporal). Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge updated from temporal graphs. This experiment was for comparing performance between levelwise CUDA pagerank with various min. compute size, ranging from 1E+3 - 1E+7. Here, min. compute size is the minimum no. nodes of each pagerank compute using standard algorithm (monolithic CUDA). Each min. compute size was attempted on different types of graphs, running each size 5 times per graph to get a good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD optimizations (using single-thread).
  • 6. Although there is no clear winner, it appears a min. compute size of 5E+6 would be a good choice. Adjusting Levelwise (STICD) CUDA approach Min. component size Min. compute size Skip teleport calculation 1. Min. component sizes for topologically-ordered components (levelwise, CUDA) PageRank. 2. Min. compute sizes for topologically-ordered components (levelwise CUDA) PageRank. Note: min. components size merges small components even before generating block-graph / topological-ordering, but min. compute size does it before pagerank computation. This experiment was for comparing performance between: CUDA based pagerank with standard algorithm (monolithic), CUDA based pagerank in topologically-ordered components fashion (levelwise). Both approaches were attempted on different types of graphs, running each approach 5 times per graph to get a good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD optimizations (using single-thread). On average, levelwise pagerank is the same performance as the monolithic approach. Comparing Levelwise (STICD) CUDA approach nvGraph Monolithic CUDA Monolithic vs vs Monolithic CUDA vs Levelwise CUDA vs vs 1. Performance of sequential execution based vs CUDA based PageRank (pull, CSR). 2. Performance of nvGraph vs CUDA based PageRank (pull, CSR). 3. Performance of Monolithic CUDA vs Levelwise CUDA PageRank (pull, CSR, ...). This experiment was for comparing the performance between: static pagerank of updated graph, dynamic pagerank of updated graph. Both techniques were attempted on different temporal graphs, updating each graph with multiple batch sizes (1, 5, 10, 50, ...). New edges are incrementally added to the graph batch-by-batch until the entire graph is complete. Dynamic pagerank is clearly faster than static approach for many batch sizes.
  • 7. Comparing dynamic CUDA approach with static nvGraph dynamic Monolithic dynamic Levelwise dynamic nvGraph static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal Monolithic static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal Levelwise static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal 1. Performance of static vs dynamic CUDA based PageRank (fixed). 2. Performance of static vs dynamic CUDA based PageRank (temporal). 3. Performance of CUDA based static vs dynamic levelwise PageRank (fixed). 4. Performance of static vs dynamic CUDA based levelwise PageRank (temporal). Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge updated from temporal graphs. This experiment was for comparing performance between: static pagerank of updated graph using nvGraph, dynamic pagerank of updated graph using nvGraph, static monolithic CUDA based pagerank of updated graph, dynamic monolithic CUDA based pagerank of updated graph, static levelwise CUDA based pagerank of updated graph, dynamic levelwise CUDA based pagerank of updated graph. Each approach was attempted on a number of graphs, running each with multiple batch sizes (1, 5, 10, 50, ...). Each batch size was run with 5 different updates to the graph, and each specific update was run 5 times for each approach to get a good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD optimizations. Indeed, dynamic levelwise pagerank is faster than the static approach for many batch sizes. In order to measure error, nvGraph pagerank is taken as a reference. Comparing dynamic optimized CUDA approach with static nvGraph dynamic Monolithic dynamic Levelwise dynamic nvGraph static vs: fixed vs: fixed vs: fixed Monolithic static vs: fixed vs: fixed vs: fixed Levelwise static vs: fixed vs: fixed vs: fixed 1. Performance of CUDA based optimized dynamic monolithic vs levelwise PageRank (fixed). Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge updated from temporal graphs.