ISWC 2012 "Efficient execution of top-k SPARQL queries"

Ef#icient
Execution
of

top-‐k
SPARQL
queries

Sara
Magliacane
(VU
University
Amsterdam)

Alessandro
Bozzon
(Politecnico
di
Milano)

Emanuele
Della
Valle
(Politecnico
di
Milano)

Outline

•  Introduc?on

•  What
are
top-‐k
queries?

•  Why
do
we
need
to
op?mize
them?

•  Our
approach:

•  A
rank-‐aware
SPARQL
algebra

•  A
rank-‐aware
execu?on
model

•  Three
planning
strategies

1

•  Evalua?on

What
is
a
top-‐k
query?

• A
query
that
returns

1.  a
limited
number
of
results
k

2.  ordered
by
a
scoring
func?on
that

combines
several
criteria

2

Rankings,
rankings
everywhere…

3

Rankings,
rankings
everywhere…

4

Rankings,
rankings
everywhere…

5

Why
do
we
need
to
optimize

them?

A
very
intui?ve
and
simpliﬁed
example:

•  Top
3
largest
countries
(by
both
area
and

popula?on)

6

The
standard
way:

materialize-‐then-‐sort
scheme

Fetch
3
best
results

Sort
all
the
242
join
combina?ons

…

Compute
all
the
242
join
combina?ons

242
242

Countries
by
Countries
by
…

…
area
popula?on
…

…
…
7

Can
we
make
it
more
ef#icient?

Can
we
exploit
the
available
sorted
access
by
area
and

by
popula?on?

Fetch
3
best
results

Order
incrementally
the
combina?ons
using
par0al
orders

7
9

Countries
by
Countries
by

area
popula?on
…

8

The
split-‐and-‐interleave
scheme

•  The
intui?on
of
the
previous
example
can
be
formalized

with
the
scheme
from
RDBMS
[Li2005,

Hwang2007,
Ilyas2004,
Ilyas2008]

1.  Split
the
evalua?on
of
the
scoring
func?on
into
single
criteria

2.  Interleave
them
with
other
operators

3.  Use
par?al
orders
to
construct
incrementally
the
ﬁnal
order

•  Standard
assump?ons:

•  Monotone
scoring
func?on

•  Each
criterion
is
evaluated
as
a
[0,1]
number
(normaliza?on)

•  Op?mized
for
the
case
of
fast
sorted
access
for
each
criterion

9

No
free
lunch…

/01(+!
! Split-‐and-‐interleave

/01(+!
234!,567! ",)*-).-!!
! Orders
of

234!,567! magnitude

",)*-).-!!

Orders
of

magnitude

>! *8697.!0:!-7;5.7-!.7;8<,;!+!
!!+=!
>! *8697.!0:!-7;5.7-!.7;8<,;!+!
!!!+=!

!

?61.0@767*,! /00!68AB!0@7.B7)-!

?61.0@767*,! /00!68AB!0@7.B7)-!

Users
are
interested
in
1",)*-).-!"#$%&'!(search
engines)

<=
k
<=
100
C537-!*8697.!0:!-7;5.7-!.7;8<,;!D!
C537-!*8697.!0:!)<<!.7;8<,;!E!
234!,567! C537-!*8697.!0:!-7;5.7-!.7;8<,;!D!
",)*-).-!"#$%&'!
"#$%&'(%)*+!C537-!
C537-!*8697.!0:!)<<!.7;8<,;!E!

Top-‐k
queries
in
SPARQL
1.1

Example
query
on
BSBM
[Bizer2009]:

•  The
top
10
oﬀers
ordered
by
the
product
ra?ngs
and
oﬀer
price:

SELECT
?product
?offer

(norm1(?avgRat1)
+
norm2(?avgRat2)
+
norm3(?price)

AS
?score)

WHERE
{

?product
hasAvgRat1
?avgRat1
.

?product
hasAvgRat2
?avgRat2
.

?product
hasName
?name
.

?product
hasOffers
?offer
.

?offer
hasPrice
?price

}

ORDER
BY
DESC
(?score)

LIMIT
10
12

t

Tens
of
seconds
on
5M

riples
(could
be
improved
to
milliseconds)

Split-‐and-‐interleave
in
SPARQL?

Related
work

•  A
possible
solu?on
[Straccia2010,
Bozzon2011]:

•  Rewrite
SPARQL
into
SQL

•  Use
exis?ng
op?mized
RDBMS
(e.g.
RankSQL
[Li2005])

•  Disadvantages:

•  Works
if
data
are
already
in
a
RDBMS

•  What
about
na?ve
SPARQL
op?miza?ons?

•  Federated
queries
over
Linked
Data
[Wagner2012]:

13

complementary
to
our
approach

Challenges
for
native
SPARQL

solutions

Query
Algebra
Algebraic

Query
plan

Planner

generator
tree

Physical
Planning

Algebra

operators
strategies

Differences
with
SQL
and
RDBMS
Proposed
solu0on

Different
algebra

STEP
1:
New
algebra
(algebraic

operators
and
algebraic

equivalences)

Different
cost
of
data
access
in
STEP
2:
New
algorithms
for

na?ve
RDF
triplestores
physical
operators,
possibly
using

(sorted
access
is
slow)
less
sorted
access
14

Addi?onal
op?miza?on
dimensions
STEP
3:
New
planning
strategies

Step
1:
a
rank-‐aware
algebra

•  SPARQL-‐Rank
algebra
[Bozzon2011]

•  Extends
the
standard
SPARQL
algebra
[Perez2009]

•  Ranked
set
of
mappings:
set
of
mappings
augmented
with
an

order
rela?on

Extended
New

OPERATORS
EQUIVALENCES

15

The
SPARQL-‐Rank
algebraic
operators

?pr, ?of, ?score ?pr, ?of, ?score ?pr, ?of, ?score

New
operator

SLICE [0,10] SLICE [0,10] SLICE [0,10]

rank
g (?p1)
Sequence
3 ?pr = ?pr
g3(?p1)

g (?a1)
1
g3(?p1) ?pr hasN ?n
seqScan
g1(?a1)
?pr hasA1 ?a1 . ?pr hasN ?n . ?pr hasA1 ?a1 . ?pr hasN ?n . ?pr hasA1 ?a1 .
?pr hasO ?of . ?of hasP1 ?p1 ?pr hasO ?of . ?of hasP1 ?p1 ?pr hasO ?of . ?of hasP1 ?p1
seqScan orderScan_a1 seqScan
(a) (b) (c)

16

The
Rank
Operator

?x
?y
?p1
?p2
?x
?y
?p1
Fp1

µ1
1
8
0.8
0.8
ρp1
µ1
1
8
0.8
1.8

µ2
3
3
0.3
0.6
µ3
3
4
0.4
1.4

µ3
3
4
0.4
0.6
µ2
3
3
0.3
1.3

Ω
ρp1(Ω
)

The
SPARQL-‐Rank
algebraic
operators

Redeﬁned

standard

operators

18

The
Join
Operator

?x
?y
?p1
Fp1

?x
?z
?p2
Fp2

µ1
1
8
0.8
1.8

µ4
1
9
0.8
1.8

µ3
3
4
0.4
1.4

µ5
3
0
0.6
1.6

µ2
3
3
0.3
1.3

Ωp1
Ω’p2

?x
?y
?z
?p1
?p2
Fp1Up2

µ1
U
µ4
1
8
9
0.8
0.8
1.6

µ3
U
µ5
3
4
0
0.4
0.6
1.0

µ2
U
µ5
3
3
0
0.3
0.6
0.9

SPARQL-‐Rank
algebraic
equivalences

Split

20

SPARQL-‐Rank
algebraic
equivalences

•  Allows
the
splimng
of
a
monolithic
scoring
func?on
into

several
rank
operators

21

SPARQL-‐Rank
algebraic
equivalences

Interleave

22

SPARQL-‐Rank
algebraic
equivalences

•  Allows
to
order
incrementally
the
results
by
pushing
the

rank
operator
inside
the
query
tree.

From
algebra
to
execution

24

Image
from:

hnp://de-‐?mekeeper.com/yahoo_site_admin/assets/images/benzinger20gold20gears200291.17120724_std.jpg

Step
2:
physical
operators

(top-‐k
algorithms)

•  Rank
operator

•  If
there
is
a
sorted
access
index
on
the
ranking
criterion
we
use
it

•  Otherwise:
rank
aggrega?on
algorithms,
e.g.
[Hwang2007]

•  Join
operator

•  If
the
right
operand
does
not
inﬂuence
the
ranking:
streaming

index
join

•  Otherwise:
a
rank-‐join
algorithm
[see
next
slides]

•  Other
operators
are
straighsorward:

25

•  E.g.
the
standard
FILTER
conserves
the
ordering
of
its
input

Rank-‐Join
algorithms

•  Diﬀerent
algorithms
based
on
available
RankJoin in
the
inputs:

access

(a)
•  Hash
Rank-‐Join
RankJoin

•  e.g.
HRJN
[Ilyas2004]

(a)
sortedAccess sortedAccess

RankSequence

(b)
RankSequence

(b)
sortedAccess randomAccess

•  Random
Access
Rank-‐Join
RA-RankJoin
•  e.g.
RA-‐HRJN
[Ilyas2004]

(c) RA-RankJoin
RankJoin
randomAccess randomAccess
(c)
(a)

•  RankSequence
(e,g,
RSEQ)
RankSequence

•  Minimum
sorted
access
(b)
26

•  Leverages
random
access

RA-RankJoin

(c)

Rank-‐Join
algorithms

•  Diﬀerent
algorithms
based
on
available
RankJoin in
the
inputs:

access

•  Hash
Rank-‐Join
(a)
RankJoin Literature

•  e.g.
HRJN
[Ilyas2004]

(a)

RankSequence

(b)
RankSequence

(b)

•  Random
Access
Rank-‐Join
RA-RankJoin
•  e.g.
RA-‐HRJN
[Ilyas2004]

(c) RA-RankJoin
RankJoin
(c)
(a)

•  RankSequence
(e,g,
RSEQ)
RankSequence

•  Minimum
sorted
access
(b)
27

•  Leverages
random
access

RA-RankJoin

(c)

Rank-‐Join
algorithms

•  Diﬀerent
algorithms
based
on
available
RankJoin in
the
inputs:

access

(a)
•  Hash
Rank-‐Join
RankJoin

•  e.g.
HRJN
[Ilyas2004]

(a)

RankSequence

(b)
RankSequence

(b)

•  Random
Access
Rank-‐Join
RA-RankJoin
•  e.g.
RA-‐HRJN
[Ilyas2004]

(c) RA-RankJoin
RankJoin
(c)
(a)

•  RankSequence
(e,g,
RSEQ)
RankSequence New

•  Minimum
sorted
access
(b)
28

•  Leverages
random
access

RA-RankJoin

(c)

Step3:
planning
strategies

•  Using
the
algebraic
equivalences
we
can
produce
several

equivalent
algebraic
trees

•  The
planner
can
use
them
to
implement
several
planning

strategies

?pr, ?of, ?score ?pr, ?of, ?score
?pr, ?of, ?score
?pr, ?of, ?score ?pr, ?of, ?score ?score
?pr, ?of, ?pr, ?of, ?pr, ?of, ?score
?score
SLICE [0,10] SLICE [0,10]
SLICE [0,10][0,10]
SLICE SLICE [0,10] [0,10]
SLICE SLICE [0,10] [0,10]
SLICE Join
ORDER
Sequence Sequence [?score] ?pr = ?pr
RankJoin
g3(?p1)(?p1)
g3 ?pr = ?pr?pr = ?pr
g3(?p1) 3(?p1)
g EXTEND ?pr = ?pr
g3(?p1) g3(?p1) ?pr hasN [?score =g1(?a1)+g2(?a2)+g3(?p1)]
?n hasN ?n
?pr RankJoin ?pr hasN ?n .
g2(?a2)
g1(?a1)(?a1)
g1 ?pr = ?pr
g1(?a1) g1(?a1) seqScanseqScan ?pr hasA1 ?a1.
?pr hasA2 ?a2 . g3(?p1) g1(?a1)
?pr hasA1 ?a1 . ?a1hasN ?n . ?n ?pr hasA1 hasA1 ?a1 . ?pr hasN ?n .
?pr hasA1 ?pr . ?pr hasN . ?pr ?a1 . ?pr hasN ?n . ?pr hasA1 ?a1 .
?pr hasA1 ?a1 . ?pr hasN ?n .
?pr hasOhasO ?of hasP1hasP1 ?p1 hasO ?of . ?of hasP1 ?p1
?pr ?of . . ?of ?p1 ?pr ?pr hasO ?of . ?of hasP1 ?p1 hasO ?ofhasO hasP1 ?p1
?pr ?pr . ?of ?of . ?of hasP1 ?p1 ?pr hasO ?of . ?pr hasO ?of .
seqScan
seqScan orderScan_a1
orderScan_a1 seqScan
seqScan ?of hasP ?p1. ?of hasP ?p1 . ?pr hasA1 ?a1 . ?pr hasA2 ?a2 .

(a) (a) (b) (b) (c) (c) (a) (b)

1.
Rank
of
BGPs
2.
Interleaved
3.
Rank
Join
29

1.
Rank
of
BGPs
(ROB)

•  Split
the
monolithic
scoring
func?on
into
several
incremental

rank
operators
(rho)

?product, ?offer, ?score

SLICE [0,10]
SLICE [0,10]

ORDER norm3(?price)
[?score]

EXTEND norm2(?avgRat2)
[?score = norm1(?avgRat1)+norm2(?avgRat2)+norm3(?price)]
norm1(?avgRat1)
?product hasAvgRat1 ?avgRat1. ?product hasAvgRat1 ?avgRat1.
?product hasAvgRat2 ?avgRat2 . ?product hasAvgRat2 ?avgRat2 .
?product hasName ?name . ?produc
?product hasName ?name .
?product hasOffer ?offer . ?produc
?product hasOffer ?offer .
?offer hasPrice ?price. ?produc
?offer hasPrice ?price.
30
?offer h

Materialize-‐then-‐sort
Rank
of
BGPs


SLICE [0,10]

2.
Interleaved
(INTER)

•  Separate
the
panern
in
two
groups:

•  Triple
panerns
that
inﬂuence
the
ranking

•  Triple
panerns
that
don’t
inﬂuence
the
ranking

?product, ?offer, ?score ?product, ?offer, ?score

SLICE [0,10] SLICE [0,10]
ORDER ?product = ?product
[?score]
norm3(?price) {?product hasName ?name }
EXTEND
[?score = norm1(?avgRat1)+norm2(?avgRat2)+norm3(?price)] norm1(?avgRat1)

?product hasAvgRat1 ?avgRat1.
norm2(?avgRat2)
?product hasAvgRat2 ?avgRat2 .
?product hasName ?name . ?product hasAvgRat1 ?avgRat1.
?product hasOffer ?offer . ?product hasAvgRat2 ?avgRat2 .
?offer hasPrice ?price. ?product hasOffer ?offer .
31

Interleaved


SLICE [0,10]

3.
Rank-‐Join
(RJ)

•  Split
into
one
triple
panern
for
each
ranking
criterion

Most
appropriate
join
algorithm
based
on
available
access

• ?product, ?offer, ?score
SLICE [0,10]
ORDER
[?score]

EXTEND
[?score = norm1(?avgRat1)+norm2(?avgRat2)+norm3(?price)]
?product, ?offer, ?score ?product, ?offer, ?score

SLICE [0,10]
?product hasAvgRat1 ?avgRat1. SLICE [0,10]
?product hasName ?name . ORDER ?product = ?product
[?score]
?product hasOffer ?offer . RankJoin
?offer hasPrice ?price. EXTEND ?product = ?product {?product hasName ?name}
[?score = norm1(?avgRat1)+norm2(?avgRat2)+norm3(?price)] RankJoin
?product = ?product norm2(?avgRat2)
?product hasAvgRat1 ?avgRat1.
norm3(?price) norm1(?avgRat1)
?product hasName ?name . ?product hasOffer ?offer . ?product hasAvgRat2 ?avgRat2}
?product hasOffer ?offer . ?offer hasPrice ?price. ?product hasAvgRat1 ?avgRat1}
32

Rank-‐Join


SLICE [0,10]

Experimental
evaluation

33

Experimental
evaluation

•  Prototype
implementa?on
of
our
system:

•  ARQ-‐Rank
(extends
Jena
ARQ
2.8.9)

•  Extended
version
of
Berlin
SPARQL
Benchmark

[Bizer2009]

•  Added
ranking
anributes

•  Added
top-‐k
queries

•  Jena
TDB
0.8.11
as
storage

34

•  Code
and
experiments:
sparqlrank.search-‐compu?ng.org

Experiment
1:
compare

planning
strategies

•  Example
query,
5M
triples
dataset

•  Worst-‐case
scenario:
no
sorted
access
indexes
(slow
sorted

access)

One
to
two

orders
of

magnitude

bener

35

Experiment
1:
compare

planning
strategies

•  Example
query,
5M
triples
dataset

•  Standard
scenario:
sorted
access
indexes
(fast
sorted
access)

Two
orders
of

magnitude

bener

36

Experiment
2:
Small
Benchmark

(8
queries)

($
!"($
!" !"($
*+,-.$,/,0+123$14,$5467$

)$
!")$
!" !")$

%$
!"%$
!" !"%$

!$
!"!$
!" !"!$
"#$ !'$ &'$ !""#$$ %&"#$$ &""#$ !'$ &'$ 37

!""#$$ %&"#$$ &""#$ !'$ &'$ !""#$$ %&"#$$ &""#$ !'$
:$6;<,$ 89:96,:$6;<,$
89:96,:$6;<,$ 89:96,:$6;<,$
%#$ !&# $
!"#$ !%#$

Conclusions
and
Future
Work

•  A
system
that
speeds
up
the
execu?on
of
top-‐k
queries
in
SPARQL

by
orders
of
magnitude:

•  STEP
1:
A
rank-‐aware
SPARQL
algebra
(SPARQL-‐Rank
algebra)

•  STEP
2:
A
rank-‐join
algorithm
(RSEQ)

•  STEP
3:
Three
planning
strategies
(ROB,
INTER,
RJ)

•  ARQ-‐Rank,
a
rank-‐aware
extension
of
Jena
ARQ

•  A
small
benchmark
for
top-‐k
queries,
based
on
BSBM
[Bizer2009]

•  All
available
at

•  Future
work:

•  More
advanced,
cost-‐based,
op?miza?on
techniques

•  Extension
to
federated
top-‐k
query
processing

38

•  Top-‐k
queries
under
OWL2QL
entailment
regime

Bibliography

•  [Bozzon2011]
A.
Bozzon
et
al.
Towards
and
eﬃcient
SPARQL
top-‐k

query
execu?on
in
virtual
RDF
stores.
In
DBRANK
workshop
at
VLDB

’11,
2011.

•  [Wagner2012]
A.
Wagner
et
al.
Top-‐k
Linked
Data
Query
Processing.

In
ESWC
’12.
Springer,
2012.

•  [Bizer2009]
C.
Bizer
and
A.
Schultz.
The
Berlin
SPARQL
Benchmark.

Int.
J.
Seman?c
Web
Inf.
Syst.,
5(2),
2009.

•  [Li2005]
C.
Li
et
al.
RankSQL:
query
algebra
and
op?miza?on
for

rela?onal
top-‐k
queries.
In
SIGMOD
’05.
ACM,
2005.

•  [DellaValle2012]
E.
Della
Valle
et
al.
Order
maners!
harnessing
a

world
of
orderings
for
reasoning
over
massive
data.
Seman?c
Web

Journal,
2012.

•  [Hwang2007]
S.-‐w.
Hwang
and
K.
Chang.
Probe
minimiza?on
by
39

schedule
op?miza?on:
Suppor?ng
top-‐k
queries
with
expensive

predicates.
IEEE
TKDE,
19(5),
2007.

Bibliography

•  [Ilyas2004]
I.
F.
Ilyas
et
al.
Rank-‐aware
Query
Op?miza?on.
In

SIGMOD
’04.
ACM,
2004.

•  [Ilyas2008]
I.F.Ilyas
et
al.
A
survey
of
top-‐k
query
processing

techniques
in
rela?onal
database
systems.
ACM
Comput.
Surv.,
40
(4),
2008.

•  [Perez2009]
J.
Perez
et
al.
Seman?cs
and
complexity
of
SPARQL.

ACM
Trans.
Database
Syst.,
34(3),
2009.

•  [Schmidt2010]
M.
Schmidt
et
al.
Founda?ons
of
SPARQL
query

op?miza?on.
In
ICDT
’10,
ACM,
2010.

•  [Straccia2010]
U.
Straccia.
SoxFacts:
A
top-‐k
retrieval
engine
for

ontology
mediated
access
to
rela?onal
databases.
In
SMC
’10.
IEEE,

2010.

40

BACK-‐UP
SLIDES
42

Why
do
we
need
to
optimize

them?

An
addi?onal
less
intui?ve
and
less

simpliﬁed
example:

•  Top
2
couples
of
most
populated
ci?es
and

largest
countries

Moscow
Shanghai

43

The
materialize-‐then-‐sort

scheme

Moscow
Shanghai

Fetch
2
best
results

Sort
all
14K
join
combina?ons

Shanghai
…
Va?can

Materialize
all
14K
combina?ons

1
249
14K*
Shanghai

0.567
Istanbul

0.563
Karachi

Mumbai

0.497
Countries
by
Ci?es
by

Moscow

0.185
area
popula?on

0.05

…

0.04

Va?can
44

2e-‐08

*
According
to
DBPedia,
but
probably
more

Can
we
make
it
more
ef#icient?

Can
we
exploit
the
sorted
access
by
area
and
by

popula?on?

Moscow
Shanghai

Fetch
2
best
results

Order
incrementally
the
combina?ons
using
par0al
orders

9
13

Shanghai

Istanbul

Karachi

Countries
by
Ci?es
by
Mumbai

area
popula?on
Moscow

…
…
45

SPARQL-‐Rank
algebra
De#initions

Mapping µ … an intermediate SPARQL solution, equivalent to a SQL
tuple
?x
?y
?p1
?p2

µ1
1
8
0.8
0.8

set of mappings
µ2
3
3
0.3
0.6

Maximal possible score
Given a scoring function F (p1, …, pn) and a set of predicates P = {p1, …,
pj} the maximal possible score for a mapping µ is defined as:

FP (p1, …, pn) [µ] = F ( pi = pi [µ] if pi ∈ P
pi = 1 otherwise
∀i )

SPARQL-‐Rank
algebra
De#initions

Ranking principle

Given two mappings µ1 e µ2 with FP [µ1]> FP [µ2] , if we process µ2 we
need to process also µ1.

Ranked set of mappings
Given a set of predicates P, a ranked set of mappings ΩP is a set of
mappings Ω augmented with the following properties:
•  Score: for each mapping µ, the maximal possible score FP [µ]
•  Order: the order relation <ΩP is defined on ΩP based on the scores
of the single mappings

The
SPARQL-‐Rank
algebraic
operators

48

SPARQL-‐Rank
algebraic
equivalences

49

SPARQL-‐Rank
algebraic
equivalences

Allows to order incrementally the results by pushing the rank operator
inside the query execution tree.

The
RSEQ
algorithm

51

Evaluation:
additional

technical
information

•  Experimental
semng:

•  AMD
64
bit
processor
2.66
GHz

•  4
GB
RAM

•  Debian
kernel
2.6.26-‐2

•  Sun
Java
1.6.0

•  Maximum
heap
size
2GB

•  8
queries
available
at

52

More
experimental
results

the
RankJoin
operators

•  Example
query,
5M
triples
dataset

•  Worst-‐case
scenario:
no
sorted
access
indexes
(lex)

•  RSEQ
is
the
best,
especially
for
k
<
1000

•  Standard
scenario:
sorted
access
indexes
(right)

•  All
three
are
comparable,
RA-‐HRJN
is
best
for
k
>
1000

53

ARQ-‐Rank
architecture

54

ISWC 2012 "Efficient execution of top-k SPARQL queries"

More Related Content

Similar to ISWC 2012 "Efficient execution of top-k SPARQL queries" (20)

Recently uploaded (20)

ISWC 2012 "Efficient execution of top-k SPARQL queries"