Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT REPORT / NOTES

Algorithmic optimizations for Dynamic Levelwise
PageRank (from STICD)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try
reducing the work per iteration, and the other is to try reducing the number of iterations.
These goals are often at odds with one another. Skipping computation on vertices which
have already converged has the potential to save iteration time. Skipping in-identical
vertices, with the same in-links, helps reduce duplicate computations and thus could help
reduce iteration time. Road networks often have chains which can be short-circuited before
pagerank computation to improve performance. Final ranks of chain nodes can be easily
calculated. This could reduce both the iteration time, and the number of iterations. If a graph
has no dangling nodes, pagerank of each strongly connected component can be
computed in topological order. This could help reduce the iteration time, no. of iterations,
and also enable multi-iteration concurrency in pagerank computation. The combination of all
of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged
components whose ranks are unaffected can be skipped altogether.
Before starting any algorithmic optimization, a good monolithic pagerank implementation
needs to be set up. There are two ways (algorithmically) to think of the pagerank calculation.
One approach (push) is to find pagerank by pushing contributions to out-vertices. The push
method is somewhat easier to implement, and is described in this lecture. With this
approach, in an iteration for each vertex, the ranks of vertices connected to its outgoing edge
are cumulated with p×rn, where p is the damping factor (0.85), and rn is the rank of the
(source) vertex in the previous iteration. But, if a vertex has no out-going edges, it is
considered to have out-going edges to all vertices in the graph (including itself). This is
because a random surfer jumps to a random page upon visiting a page with no links, in order
to avoid the rank-sink effect. However, it requires multiple writes per source vertex, due to
the cumulation (+=) operation. The other approach (pull) is to pull contributions from
in-vertices. Here, the rank of a vertex in an iteration is calculated as c0 + pΣrn/dn, where c0 is
the common teleport contribution, p is the damping factor (0.85), rn is the previous rank of
vertex with an incoming edge, dn is the out-degree of the incoming-edge vertex, and N is the
total number of vertices in the graph. The common teleport contribution c0, calculated as
(1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex in the graph due
to the damping factor (1-p)/N, and teleport from dangling vertices (with no outgoing edges) in
the graph pΣrn/N (to avoid the rank-sink effect). This approach requires 2 additional
calculations per-vertex, i.e., non-teleport contribution of each vertex, and total teleport
contribution (to all vertices). However, it requires only 1 write per destination vertex. For this
experiment both of these approaches are assessed on a number of different graphs.
All graphs used are stored in the MatrixMarket (.mtx) file format, and obtained from the
SuiteSparse Matrix Collection. The experiment is implemented in C++, and compiled using
GCC 9 with optimization level 3 (-O3). The system used is a Dell PowerEdge R740 Rack
server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4 Synchronous
Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running CentOS Linux release
7.9.2009 (Core). The execution time of each test case is measured using
std::chrono::high_performance_timer. This is done 5 times for each test case, and timings

are averaged. Statistics of each test case is printed to standard output (stdout), and
redirected to a log file, which is then processed with a script to generate a CSV file, with
each row representing the details of a single test case. This CSV file is imported into Google
Sheets, and necessary tables are set up with the help of the FILTER function to create the
charts.
While it might seem that the pull method would be a clear winner, the results indicate that
although pull is always faster than push approach, the difference between the two depends
on the nature of the graph. The next step is to compare the performance between finding
pagerank using C++ DiGraph class directly (using arrays of edge-lists) vs its CSR
(Compressed Sparse Row) representation (contiguous). Using a CSR representation has
the potential for performance improvement due to information on vertices and edges being
stored contiguously.
Adjusting Monolithic (Sequential) approach
Push Pull Class CSR
1. Performance of contribution-push based vs contribution-pull based PageRank.
2. Performance of C++ DiGraph class based vs CSR based PageRank (pull).
Next an experiment is conducted to assess the performance benefit of each algorithmic
optimization separately. For splitting graph by components optimization, the following
approaches are compared: pagerank without optimization, pagerank with vertices split by
components, and finally pagerank with components sorted in topological order. Components
of the graph are obtained using Kosaraju’s algorithm. Topological ordering is done by
representing the graph as a block-graph, where each component is represented as a vertex,
and cross-edges between components are represented as edges. This block-graph is then
topologically sorted, and this vertex-order in block-graph is used to reorder the components
in topological order. Vertices, and their respective edges are accordingly simply reordered
before computing pagerank (no graph partitioning is done). Each approach was attempted
on a number of graphs. On a few graphs, splitting vertices by components provides a
speedup, but sorting components in topological order provides no additional speedup. For
road networks, like germany_osm which only have one component, the speedup is possibly
because of the vertex reordering caused by dfs() which is required for splitting by
components. For skipping in-identicals optimization, comparison is done with unoptimized
pagerank. In-identical vertices are obtained by scanning matching edges of a vertex by
in-vertices hash. Except the first in-identical vertex of an in-identicals-group, remaining
vertices are skipped during pagerank computation. After each iteration ends, rank of the first
in-identical vertex is copied to the remaining vertices of the in-identicals group. The vertices
to be skipped are marked with negative source-offset in CSR. On indochina-2004 graph,
skipping in-identicals provides a speedup of ~1.8, but on average provides no speedup for
other graphs. This is likely due to the fact that the graph indochina-2004 has a large number
of in-identicals and in-identical groups. Although, it doesn't have the highest in-identicals %
or the highest avg. in-identical group size. For skipping chains optimization, comparison is

done with unoptimized pagerank. It is important to note that a chain here means a set of
unidirectional links connecting one vertex to the next, without any additional edges.
Bi-directional links are not considered as chains. Chain vertices are obtained by traversing
2-degree vertices in both directions and marking visited ones. Except the first chain vertex of
a chains-group, remaining vertices are skipped during pagerank computation. After each
iteration ends, ranks of the remaining vertices in each chains-group is updated using the
(GP) formula c0×(1-pn
)/(1-p) + pn
×r, where c0 is the common teleport contribution, p is the
damping factor, n is the distance from the first chain vertex, and r is the rank of the first chain
vertex in previous iteration. The vertices to be skipped are marked with negative
source-offset in CSR. On average, skipping chain vertices provides no speedup. This is
likely because most graphs don't have enough chains to provide an advantage. Road
networks do have chains, but they are bi-directional, and thus not considered here. For
skipping converged vertices optimization, the following approaches are compared:
pagerank without optimization, pagerank skipping converged vertices with re-check (in 2-16
turns), and pagerank skipping converged vertices after several turns (in 2-64 turns). Skip
with re-check (skip-check) approach skips the current iteration for a vertex if its rank for the
last two iterations match, and the current turn (iteration) is not a “check” turn. The check turn
is adjusted between 2-16 turns. Skip after turns (skip-after) skips all future iterations of a
vertex after its rank does not change for “after” turns. The after turns are adjusted between
2-64 turns. On average, neither skip-check, nor skip-after gives better speed than the default
(unoptimized) approach. This could be due to the unnecessary iterations added by
skip-check (mistakenly skipped), and increased memory accesses performed by skip-after
(tracking converged count).
Adjusting Monolithic optimizations (from STICD)
Split components Skip in-identicals Skip chains Skip converged
1. Performance benefit of PageRank with vertices split by components (pull, CSR).
2. Performance benefit of skipping in-identical vertices for PageRank (pull, CSR).
3. Performance benefit of skipping chain vertices for PageRank (pull, CSR).
4. Performance benefit of skipping converged vertices for PageRank (pull, CSR).
This experiment was for comparing performance between levelwise pagerank with various
min. compute size, ranging from 1 - 1E+7. Here, min. compute size is the minimum no.
nodes of each pagerank compute using standard algorithm (monolithic). Each min. compute
size was attempted on different types of graphs, running each size 5 times per graph to get a
good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD optimizations
(using single-thread). Although there is no clear winner, it appears a min. compute size of 10
would be a good choice. Note that the levelwise approach does not make use of SIMD
instructions which are available on all modern hardware.
This experiment was for comparing performance between: monolithic pagerank, monolithic
pagerank skipping teleport, levelwise pagerank, levelwise pagerank skipping teleport. Each
approach was attempted on different types of graphs, running each approach 5 times per

graph to get a good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD
optimizations (using single-thread).
Except for soc-LiveJournal1 and coPapersCiteseer, in all cases skipping teleport calculations
is slightly faster (could be a random issue for the two). The improvement is most prominent
in case of road networks and certain web graphs.
Adjusting Levelwise (STICD) approach
Min. component size Min. compute size Skip teleport calculation
1. Comparing various min. component sizes for topologically-ordered components (levelwise...).
2. Comparing various min. compute sizes for topologically-ordered components (levelwise...).
3. Checking performance benefit of levelwise PageRank when teleport calculation is skipped.
Note: min. components size merges small components even before generating block-graph /
topological-ordering, but min. compute size does it before pagerank computation.
This experiment was for comparing performance between: pagerank with standard
algorithm (monolithic), pagerank in topologically-ordered components fashion (levelwise).
Both approaches were attempted on different types of graphs, running each approach 5
times per graph to get a good time measure. Levelwise pagerank is the STIC-D algorithm,
without ICD optimizations (using single-thread).
On average, levelwise pagerank is faster than the monolithic approach. Note that neither
approach makes use of SIMD instructions which are available on all modern hardware.
Comparing Levelwise (STICD) approach
Monolithic nvGraph
Levelwise (STICD) vs
1. Performance of monolithic vs topologically-ordered components (levelwise) PageRank.
This experiment was for comparing performance between: static levelwise pagerank,
dynamic levelwise pagerank (process all components), dynamic levelwise pagerank skipping
unchanged components. Each approach was attempted on a number of graphs (fixed and
temporal), running each with multiple batch sizes (1, 5, 10, 50, ...). Levelwise pagerank is the
STIC-D algorithm, without ICD optimizations (using single-thread).
On average, skipping unchanged components is barely faster than not skipping.

Adjusting Levelwise (STICD) dynamic approach
Skip unaffected components For fixed graphs For temporal graphs
1. Checking for correctness of levelwise PageRank when unchanged components are skipped.
2. Perf. benefit of levelwise PageRank when unchanged components are skipped (fixed).
3. Perf. benefit of levelwise PageRank when unchanged components are skipped (temporal).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
This experiment was for comparing performance between: static pagerank using standard
algorithm (monolithic), static pagerank using levelwise algorithm, dynamic pagerank using
levelwise algorithm. Each approach was attempted on a number of graphs, running each
with multiple batch sizes (1, 5, 10, 50, ...). Each pagerank computation was run 5 times for
both approaches to get a good time measure. Levelwise pagerank is the STIC-D algorithm,
Clearly, dynamic levelwise pagerank is faster than the static approach for many batch sizes.
Comparing dynamic approach with static
nvGraph dynamic Monolithic dynamic Levelwise dynamic
nvGraph static vs: temporal
Monolithic static vs: fixed, temporal vs: fixed, temporal
Levelwise static vs: fixed vs: fixed, temporal
1. Performance of nvGraph based static vs dynamic PageRank (temporal).
2. Performance of static vs dynamic PageRank (temporal).
3. Performance of static vs dynamic levelwise PageRank (fixed).
4. Performance of levelwise based static vs dynamic PageRank (temporal).
This experiment was for comparing performance between levelwise CUDA pagerank with
various min. compute size, ranging from 1E+3 - 1E+7. Here, min. compute size is the
minimum no. nodes of each pagerank compute using standard algorithm (monolithic CUDA).
Each min. compute size was attempted on different types of graphs, running each size 5
times per graph to get a good time measure. Levelwise pagerank is the STIC-D algorithm,

Although there is no clear winner, it appears a min. compute size of 5E+6 would be a good
choice.
Adjusting Levelwise (STICD) CUDA approach
Min. component size Min. compute size Skip teleport calculation
1. Min. component sizes for topologically-ordered components (levelwise, CUDA) PageRank.
2. Min. compute sizes for topologically-ordered components (levelwise CUDA) PageRank.
Note: min. components size merges small components even before generating block-graph /
topological-ordering, but min. compute size does it before pagerank computation.
This experiment was for comparing performance between: CUDA based pagerank with
standard algorithm (monolithic), CUDA based pagerank in topologically-ordered components
fashion (levelwise). Both approaches were attempted on different types of graphs, running
each approach 5 times per graph to get a good time measure. Levelwise pagerank is the
STIC-D algorithm, without ICD optimizations (using single-thread).
On average, levelwise pagerank is the same performance as the monolithic approach.
Comparing Levelwise (STICD) CUDA approach
nvGraph Monolithic CUDA
Monolithic vs vs
Monolithic CUDA vs
Levelwise CUDA vs vs
1. Performance of sequential execution based vs CUDA based PageRank (pull, CSR).
2. Performance of nvGraph vs CUDA based PageRank (pull, CSR).
3. Performance of Monolithic CUDA vs Levelwise CUDA PageRank (pull, CSR, ...).
This experiment was for comparing the performance between: static pagerank of updated
graph, dynamic pagerank of updated graph. Both techniques were attempted on different
temporal graphs, updating each graph with multiple batch sizes (1, 5, 10, 50, ...). New edges
are incrementally added to the graph batch-by-batch until the entire graph is complete.
Dynamic pagerank is clearly faster than static approach for many batch sizes.

Comparing dynamic CUDA approach with static
nvGraph static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
Monolithic static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
Levelwise static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
1. Performance of static vs dynamic CUDA based PageRank (fixed).
2. Performance of static vs dynamic CUDA based PageRank (temporal).
3. Performance of CUDA based static vs dynamic levelwise PageRank (fixed).
4. Performance of static vs dynamic CUDA based levelwise PageRank (temporal).
This experiment was for comparing performance between: static pagerank of updated graph
using nvGraph, dynamic pagerank of updated graph using nvGraph, static monolithic CUDA
based pagerank of updated graph, dynamic monolithic CUDA based pagerank of updated
graph, static levelwise CUDA based pagerank of updated graph, dynamic levelwise CUDA
based pagerank of updated graph. Each approach was attempted on a number of graphs,
running each with multiple batch sizes (1, 5, 10, 50, ...). Each batch size was run with 5
different updates to the graph, and each specific update was run 5 times for each approach
to get a good time measure. Levelwise pagerank is the STIC-D algorithm, without ICD
optimizations.
Indeed, dynamic levelwise pagerank is faster than the static approach for many batch sizes.
In order to measure error, nvGraph pagerank is taken as a reference.
Comparing dynamic optimized CUDA approach with static
nvGraph static vs: fixed vs: fixed vs: fixed
Monolithic static vs: fixed vs: fixed vs: fixed
Levelwise static vs: fixed vs: fixed vs: fixed
1. Performance of CUDA based optimized dynamic monolithic vs levelwise PageRank (fixed).

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT REPORT / NOTES

More Related Content

Similar to Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT REPORT / NOTES (20)

More from Subhajit Sahu (20)

Recently uploaded (20)

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT REPORT / NOTES