SlideShare a Scribd company logo
Building High-Performance
Language Implementations
With Low Effort
Stefan Marr
FOSDEM 2015, Brussels, Belgium
January 31st, 2015
@smarr
http://stefan-marr.de
Why should you care about how
Programming Languages work?
2
SMBC: http://www.smbc-comics.com/?id=2088
3
SMBC: http://www.smbc-comics.com/?id=2088
Why should you care about how
Programming Languages work?
• Performance isn’t magic
• Domain-specific languages
• More concise
• More productive
• It’s easier than it looks
• Often open source
• Contributions welcome
What’s “High-Performance”?
4
Based on latest data from http://benchmarksgame.alioth.debian.org/
Geometric mean over available benchmarks.
Disclaimer: Not indicate for application performance!
Competitively Fast!
0
3
5
8
10
13
15
18
Java V8 C# Dart Python Lua PHP Ruby
Small and
Manageable
16
260
525
562
1 10 100 1000
What’s “Low Effort”?
5
KLOC: 1000 Lines of Code, without blank lines and comments
V8 JavaScript
HotSpot
Java Virtual Machine
Dart VM
Lua 5.3 interp.
Language Implementation Approaches
6
Source
Program
Interpreter
Run TimeDevelopment
Time
Input
Output
Source
Program
Compiler Binary
Input
Output
Run TimeDevelopment
Time
Simple, but often slow More complex, but often faster
Not ideal for all languages.
Modern Virtual Machines
7
Source
Program
Interpreter
Run TimeDevelopment Time
Input
Output
Binary
Runtime Info
Compiler
Virtual Machine
with
Just-In-Time
Compilation
VMs are Highly Complex
8
Interpreter
Input
Output
Compiler Optimizer
Garbage
Collector
CodeGen
Foreign
Function
Interface
Threads
and
Memory
Model
How to reuse most parts
for a new language?
Debugging
Profiling
…
Easily
500 KLOC
How to reuse most parts
for a new language?
9
Input
Output
Make Interpreters Replaceable Components!
Interpreter
Compiler Optimizer
Garbage
Collector
CodeGen
Foreign
Function
Interface
Threads
and
Memory
Model
Garbage
Collector
…
Interpreter
Interpreter
…
Interpreter-based Approaches
Truffle + Graal
with Partial Evaluation
Oracle Labs
RPython
with Meta-Tracing
[3] Würthinger et al., One VM to Rule Them All, Onward!
2013, ACM, pp. 187-204.
[2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT
Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
SELF-OPTIMIZING TREES
A Simple Technique for Language Implementation and Optimization
[1] Würthinger, T.; Wöß, A.; Stadler, L.; Duboscq, G.; Simon, D. & Wimmer, C. (2012), Self-
Optimizing AST Interpreters, in 'Proc. of the 8th Dynamic Languages Symposium' , pp. 73-82.
Code Convention
12
Python-ish
Interpreter Code
Java-ish
Application Code
A Simple
Abstract Syntax Tree Interpreter
13
root_node = parse(file)
root_node.execute(Frame())
if (condition) {
cnt := cnt + 1;
} else {
cnt := 0;
}
cnt
1
+
cnt:
=
if
cnt:
=
0
cond
root_node
Implementing AST Nodes
14
if (condition) {
cnt := cnt + 1;
} else {
cnt := 0;
}
class Literal(ASTNode):
final value
def execute(frame):
return value
class VarWrite(ASTNode):
child sub_expr
final idx
def execute(frame):
val := sub_expr.execute(frame)
frame.local_obj[idx]:= val
return val
class VarRead(ASTNode):
final idx
def execute(frame):
return frame.local_obj[idx]
cnt
1
+
cnt:
=
if
cnt:
=
0
cond
Self-Optimization by Node Specialization
15
cnt := cnt + 1
def UninitVarWrite.execute(frame):
val := sub_expr.execute(frame)
return specialize(val).
execute_evaluated(frame, val)
uninitialized
variable write
cnt
1
+
cnt:
=
cnt:
=
def UninitVarWrite.specialize(val):
if val instanceof int:
return replace(IntVarWrite(sub_expr))
elif …:
…
else:
return replace(GenericVarWrite(sub_expr))
specialized
Self-Optimization by Node Specialization
16
cnt := cnt + 1
def IntVarWrite.execute(frame):
try:
val := sub_expr.execute_int(frame)
return execute_eval_int(frame, val)
except ResultExp, e:
return respecialize(e.result).
execute_evaluated(frame, e.result)
def IntVarWrite.execute_eval_int(frame, anInt):
frame.local_int[idx] := anInt
return anInt
int
variable write
cnt
1
+
cnt:
=
Some Possible Self-Optimizations
• Type profiling and specialization
• Value caching
• Inline caching
• Operation inlining
• Library Lowering
17
Library Lowering for Array class
createSomeArray() { return Array.new(1000, ‘fast fast fast’); }
18
class Array {
static new(size, lambda) {
return new(size).setAll(lambda);
}
setAll(lambda) {
forEach((i, v) -> { this[i] = lambda.eval(); });
}
}
class Object {
eval() { return this; }
}
Optimizing for Object Values
19
createSomeArray() { return Array.new(1000, ‘fast fast fast’); }
.new
Array
global lookup
method
invocation
1000
int literal
‘fast’
string literal
Object, but not a lambda
Optimization
potential
Specialized new(size, lambda)
def UninitArrNew.execute(frame):
size := size_expr.execute(frame)
val := val_expr.execute(frame)
return specialize(size, val).
execute_evaluated(frame, size, val)
20
createSomeArray() { return Array.new(1000, ‘fast fast fast’); }
def UninitArrNew.specialize(size, val):
if val instanceof Lambda:
return replace(StdMethodInvocation())
else:
return replace(ArrNewWithValue())
Specialized new(size, lambda)
def ArrNewWithValue.execute_evaluated(frame, size,
val):
return Array([val] * 1000)
21
createSomeArray() { return Array.new(1000, ‘fast fast fast’); }
1 specialized node vs. 1000x `this[i] = lambda.eval()`
1000x `eval() { return this; }`
.new
Array
global lookup
1000
int literal
‘fast’
string literal
specialized
JUST-IN-TIME COMPILATION FOR
INTERPRETERS
Generating Efficient Native Code
22
How to Get Fast Program Execution?
23
VarWrite.execute(frame)
IntVarWrite.execute(frame)
VarRead.execute(frame)
Literal.execute(frame)
ArrayNewWithValue.execute(frame)
..VW_execute() # bin
..IVW_execute() # bin
..VR_execute() # bin
..L_execute() # bin
..ANWV_execute() # bin
Standard Compilation: 1 node at a time
Minimal Optimization Potential
Problems with Node-by-Node Compilation
24
cnt
1
+
cnt:
=
Slow Polymorphic Dispatches
def IntVarWrite.execute(frame):
try:
val := sub_expr.execute_int(frame)
return execute_eval_int(frame, val)
except ResultExp, e:
return respecialize(e.result).
execute_evaluated(frame, e.result)
cnt:
=
Runtime checks in general
Compilation Unit based on User Program
Meta-Tracing Partial Evaluation
Guided By AST
25
cnt
1
+
cnt:
=
if
cnt:
=
0
cnt
1
+
cnt:
=if cnt:
=
0
[3] Würthinger et al., One VM to Rule Them
All, Onward! 2013, ACM, pp. 187-204.
[2] Bolz et al., Tracing the Meta-level: PyPy's
Tracing JIT Compiler, ICOOOLPS Workshop
2009, ACM, pp. 18-25.
RPython
Just-in-Time Compilation with
Meta Tracing
RPython
• Subset of Python
– Type-inferenced
• Generates VMs
27
Interpreter
source
RPython
Toolchain
Meta-Tracing
JIT Compiler
Interpreter
http://rpython.readthedocs.org/
Garbage
Collector
…
Meta-Tracing of an Interpreter
28
cnt
1
+cnt:=
if
cnt:= 0
[2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop
2009, ACM, pp. 18-25.
Meta Tracers need to know the Loops
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
jit_merge_point(node=self)
cond = cond_expr.execute_bool(frame)
if not cond:
break
body_expr.execute(frame)
29
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
Tracing Records one Concrete Execution
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
30
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
Tracing Records one Concrete Execution
class IntVarRead(ASTNode):
final idx
def execute_int(frame):
if frame.is_int(idx):
return frame.local_int[idx]
else:
new_node = respecialize()
raise UnexpectedResult(new_node.execute())
31
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
Tracing Records one Concrete Execution
class IntVarRead(ASTNode):
final idx
def execute_int(frame):
if frame.is_int(idx):
return frame.local_int[idx]
else:
new_node = respecialize()
raise UnexpectedResult(new_node.execute())
32
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
Tracing Records one Concrete Execution
class IntVarRead(ASTNode):
final idx
def execute_int(frame):
if frame.is_int(idx):
return frame.local_int[idx]
else:
new_node = respecialize()
raise UnexpectedResult(new_node.execute())
33
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
Tracing Records one Concrete Execution
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
34
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
Tracing Records one Concrete Execution
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
35
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
guard(right_expr == Const(IntLiteral))
Tracing Records one Concrete Execution
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
36
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
guard(right_expr == Const(IntLiteral))
i5 := right_expr.value # Const(100)
Tracing Records one Concrete Execution
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
37
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
guard(right_expr == Const(IntLiteral))
i5 := right_expr.value # Const(100)
guard_no_exception(Const(UnexpectedResult)
Tracing Records one Concrete Execution
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
38
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
guard(right_expr == Const(IntLiteral))
i5 := right_expr.value # Const(100)
guard_no_exception(Const(UnexpectedResult)
b1 := i4 < i5
Tracing Records one Concrete Execution
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
jit_merge_point(node=self)
cond = cond_expr.execute_bool(frame)
if not cond:
break
body_expr.execute(frame)
39
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
guard(right_expr == Const(IntLiteral))
i5 := right_expr.value # Const(100)
guard_no_exception(Const(UnexpectedResult)
b1 := i4 < i5
guard_true(b1)
Tracing Records one Concrete Execution
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
jit_merge_point(node=self)
cond = cond_expr.execute_bool(frame)
if not cond:
break
body_expr.execute(frame)
40
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
guard(right_expr == Const(IntLiteral))
i5 := right_expr.value # Const(100)
guard_no_exception(Const(UnexpectedResult)
b1 := i4 < i5
guard_true(b1)
...
Traces are Ideal for Optimization
guard(cond_expr ==
Const(IntLessThan))
guard(left_expr ==
Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(
Const(UnexpectedResult))
guard(right_expr ==
Const(IntLiteral))
i5 := right_expr.value # Const(100)
guard_no_exception(
Const(UnexpectedResult))
b1 := i4 < i5
guard_true(b1)
...
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i1 := a1[Const(1)]
guard(i1 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
i5 := right_expr.value # Const(100)
b1 := i2 < i5
guard_true(b1)
...
a1 := frame.layout
i1 := a1[1]
guard(i1 == F_INT)
a2 := frame.local_int
i2 := a2[1]
b1 := i2 < 100
guard_true(b1)
...
Truffle + Graal
Just-in-Time Compilation with
Partial Evaluation
Oracle Labs
Truffle+Graal
• Java framework
– AST interpreters
• Based on HotSpot
JVM
43
Interpreter
Graal Compiler +
Truffle Partial Evaluator
http://www.ssw.uni-linz.ac.at/Research/Projects/JVM/Truffle.html
http://www.oracle.com/technetwork/oracle-labs/program-languages/overview/index-2301583.html
Garbage
Collector
…
+ Truffle
Framework
HotSpot JVM
Partial Evaluation Guided By AST
44
cnt
1
+cnt:=
if
cnt:= 0
[3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204.
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
cond = cond_expr.execute_bool(frame)
if not cond:
break
body_expr.execute(frame)
45
while (cnt < 100) {
cnt := cnt + 1;
}
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
cond = cond_expr.execute_bool(frame)
if not cond:
break
body_expr.execute(frame)
46
while (cnt < 100) {
cnt := cnt + 1;
}
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
try:
left = cond_expr.left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = cond_expr.right_expr.execute_int()
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
body_expr.execute(frame)
47
while (cnt < 100) {
cnt := cnt + 1;
}
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
try:
left = cond_expr.left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = cond_expr.right_expr.execute_int()
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
class IntVarRead(ASTNode):
final idx
def execute_int(frame):
if frame.is_int(idx):
return frame.local_int[idx]
else:
new_node = respecialize()
raise UnexpectedResult(new_node.ex
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
try:
if frame.is_int(1):
left = frame.local_int[1]
else:
new_node = respecialize()
raise UnexpectedResult(new_node.execute())
except UnexpectedResult r:
...
try:
right = cond_expr.right_expr.execute_int()
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
while (cnt < 100) {
cnt := cnt + 1;
}
Optimize Optimistically
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
try:
if frame.is_int(1):
left = frame.local_int[1]
else:
new_node = respecialize()
raise UnexpectedResult(new_node.execute())
except UnexpectedResult r:
...
try:
right = cond_expr.right_expr.execute_int()
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
while (cnt < 100) {
cnt := cnt + 1;
}
Optimize Optimistically
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
if frame.is_int(1):
left = frame.local_int[1]
else:
__deopt_return_to_interp()
try:
right = cond_expr.right_expr.execute_int()
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
if frame.is_int(1):
left = frame.local_int[1]
else:
__deopt_return_to_interp()
try:
right = cond_expr.right_expr.execute_int()
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
class IntLiteral(ASTNode):
final value
def execute_int(frame):
return value
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
if frame.is_int(1):
left = frame.local_int[1]
else:
__deopt_return_to_interp()
try:
right = 100
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
class IntLiteral(ASTNode):
final value
def execute_int(frame):
return value
Classic Optimizations:
Dead Code Elimination
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
if frame.is_int(1):
left = frame.local_int[1]
else:
__deopt_return_to_interp()
try:
right = 100
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
class IntLiteral(ASTNode):
final value
def execute_int(frame):
return value
Classic Optimizations:
Constant Propagation
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
if frame.is_int(1):
left = frame.local_int[1]
else:
__deopt_return_to_interp()
right = 100
cond = left < right
if not cond:
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
class IntLiteral(ASTNode):
final value
def execute_int(frame):
return value
Classic Optimizations:
Loop Invariant Code Motion
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
if frame.is_int(1):
left = frame.local_int[1]
else:
__deopt_return_to_interp()
if not (left < 100):
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
if not frame.is_int(1):
__deopt_return_to_interp()
while True:
if not (frame.local_int[1] < 100):
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
Classic Optimizations:
Loop Invariant Code Motion
Compilation Unit based on User Program
Meta-Tracing Partial Evaluation
Guided by AST
58
cnt
1
+
cnt:
=
if
cnt:
=
0
cnt
1
+
cnt:
=if cnt:
=
0
[3] Würthinger et al., One VM to Rule Them
All, Onward! 2013, ACM, pp. 187-204.
[2] Bolz et al., Tracing the Meta-level: PyPy's
Tracing JIT Compiler, ICOOOLPS Workshop
2009, ACM, pp. 18-25.
WHAT’S POSSIBLE FOR A SIMPLE
INTERPRETER?
Results
59
Designed for Teaching:
• Simple
• Conceptual Clarity
• An Interpreter family
– in C, C++, Java, JavaScript,
RPython, Smalltalk
Used in the past by:
http://som-st.github.io
60
Self-Optimizing SOMs
61
SOMME
RTruffleSOM
Meta-Tracing
RPython
SOMPE
TruffleSOM
Partial Evaluation +
Graal Compiler
on the HotSpot JVM
JIT Compiled JIT Compiled
github.com/SOM-st/TruffleSOMgithub.com/SOM-st/RTruffleSOM
Java 8 -server vs. SOM+JIT
JIT-compiled Peak Performance
62
3.5x slower
(min. 1.6x, max. 6.3x)
RPython
2.8x slower
(min. 3%, max. 5x)
Truffle+Graal
Compiled
SOMMT
Compiled
SOMPE
●●●
●●●
●●●●●●●●●●
●
●●●●●●
●●●●
●●
●●
●
●●●●●●
●●●●●●●●●●●
●●●
●●●●●●●
●
●●
●
●●●
●●●●
●
●●●●●●●●
●
●●●●
●●●
●●●●●●●●●●●
●
●
●
●●●
●●●●●●
●
●●●●●●●
●
●●●●●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●
●●●●●●●●
●
●
●
●●●●●●●●●●
●
●
●
●●
1
4
8
Bounce
BubbleSort
DeltaBlue
Fannkuch
Mandelbrot
NBody
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Bounce
BubbleSort
DeltaBlue
Fannkuch
Mandelbrot
NBody
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Runtimenormalizedto
Java(compiledorinterpreted)
Implementation: Smaller Than Lua
63
Meta-Tracing
SOMMT (RTruffleSOM)
Partial Evaluation
SOMPE (TruffleSOM)
KLOC: 1000 Lines of Code, without blank lines and comments
4.2
9.8
16
260
525
562
1 10 100 1000
V8 JavaScript
HotSpot
Java Virtual Machine
Dart VM
Lua 5.3 interp.
CONCLUSION
64
Simple and Fast Interpreters are Possible!
• Self-optimizing AST interpreters
• RPython or Truffle for JIT Compilation
65
[1] Würthinger et al., Self-Optimizing AST Interpreters, Proc. of the 8th Dynamic Languages Symposium, 2012, pp.
73-82.
[3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204.
[4] Marr et al., Are We There Yet? Simple Language Implementation Techniques for the 21st Century. IEEE Software
31(5):60—67, 2014
[2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
Literature on the ideas:
RPython
• #pypy on irc.freenode.net
• rpython.readthedocs.org
• Kermit Example interpreter
https://bitbucket.org/pypy/example-interpreter
• A Tutorial
http://morepypy.blogspot.be/2011/04/tutorial-
writing-interpreter-with-pypy.html
• Language implementations
https://www.evernote.com/shard/s130/sh/4d42
a591-c540-4516-9911-
c5684334bd45/d391564875442656a514f7ece5
602210
Truffle
• http://mail.openjdk.java.net/
mailman/listinfo/graal-dev
• SimpleLanguage interpreter
https://github.com/OracleLabs/GraalVM/tree/mast
er/graal/com.oracle.truffle.sl/src/com/oracle/truffle
/sl
• A Tutorial
http://cesquivias.github.io/blog/2014/10/13/writin
g-a-language-in-truffle-part-1-a-simple-slow-
interpreter/
• Project
– http://www.ssw.uni-
linz.ac.at/Research/Projects/JVM/Truffle.html
– http://www.oracle.com/technetwork/oracle-
labs/program-languages/overview/index-
2301583.html 66
Big Thank You!
to both communities,
for help, answering questions, debugging support, etc…!!!
Languages: Small, Elegant, and Fast!
67
cn
t
1
+
cnt:
=
if
cnt:
=
0
cnt
1
+cnt:=
if
cnt:= 0
Compiled
SOMMT
Compiled
SOMPE
●●●
●●●
●●●●●●●●●●
●
●●●●●●
●●●●
●●
●●
●
●●●●●●
●●●●●●●●●●●
●●●
●●●●●●●
●
●●
●
●●●
●●●●
●
●●●●●●●●
●
●●●●
●●●
●●●●●●●●●●●
●
●
●
●●●
●●●●●●
●
●●●●●●●
●
●●●●●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●
●●●●●●●●
●
●
●
●●●●●●●●●●
●
●
●
●●
1
4
8 Bounce
BubbleSort
DeltaBlue
Fannkuch
Mandelbrot
NBody
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Bounce
BubbleSort
DeltaBlue
Fannkuch
Mandelbrot
NBody
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Runtimenormalizedto
Java(compiledorinterpreted)
3.5x slower
(min. 1.6x, max. 6.3x)
4.2 KLOC
RPython
2.8x slower
(min. 3%, max. 5x)
9.8 KLOC
Truffle+Graal
@smarr | http://stefan-marr.de

More Related Content

PPTX
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...
PPTX
Optimizing Communicating Event-Loop Languages with Truffle
PDF
C++ How I learned to stop worrying and love metaprogramming
PPTX
Async await in C++
PDF
Concurrency Concepts in Java
PPTX
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
PDF
Address/Thread/Memory Sanitizer
PPTX
Accelerating Habanero-Java Program with OpenCL Generation
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...
Optimizing Communicating Event-Loop Languages with Truffle
C++ How I learned to stop worrying and love metaprogramming
Async await in C++
Concurrency Concepts in Java
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
Address/Thread/Memory Sanitizer
Accelerating Habanero-Java Program with OpenCL Generation

What's hot (20)

PDF
Blocks & GCD
PDF
Exploiting Concurrency with Dynamic Languages
PDF
[JavaOne 2011] Models for Concurrent Programming
PDF
To Swift 2...and Beyond!
PPT
bluespec talk
PDF
Arduino C maXbox web of things slide show
PDF
Конверсия управляемых языков в неуправляемые
PPT
Deuce STM - CMP'09
PPTX
Introduction to Rust language programming
PPT
iOS Development with Blocks
PPTX
同態加密
PDF
Engineering fast indexes (Deepdive)
PPT
NS2: Binding C++ and OTcl variables
PPTX
Seeing with Python presented at PyCon AU 2014
PPTX
Blazing Fast Windows 8 Apps using Visual C++
PPTX
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
PPTX
Functional Reactive Programming with RxJS
PPT
NS2 Classifiers
PPTX
Return of c++
PPTX
C++ & Java JIT Optimizations: Finding Prime Numbers
Blocks & GCD
Exploiting Concurrency with Dynamic Languages
[JavaOne 2011] Models for Concurrent Programming
To Swift 2...and Beyond!
bluespec talk
Arduino C maXbox web of things slide show
Конверсия управляемых языков в неуправляемые
Deuce STM - CMP'09
Introduction to Rust language programming
iOS Development with Blocks
同態加密
Engineering fast indexes (Deepdive)
NS2: Binding C++ and OTcl variables
Seeing with Python presented at PyCon AU 2014
Blazing Fast Windows 8 Apps using Visual C++
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
Functional Reactive Programming with RxJS
NS2 Classifiers
Return of c++
C++ & Java JIT Optimizations: Finding Prime Numbers
Ad

Viewers also liked (7)

PPTX
Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...
PPT
low effort judgement
PPT
high effort judgement
PDF
Swatch Creative Strategy
PDF
Swatch case
PPTX
Buying Decision Making Process
PPTX
Consumer behaviour internal factors
Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...
low effort judgement
high effort judgement
Swatch Creative Strategy
Swatch case
Buying Decision Making Process
Consumer behaviour internal factors
Ad

Similar to Building High-Performance Language Implementations With Low Effort (20)

PDF
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
PDF
How to write clean & testable code without losing your mind
PPTX
Building Efficient and Highly Run-Time Adaptable Virtual Machines
PPTX
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
PDF
Arvindsujeeth scaladays12
PDF
Design Patterns - Compiler Case Study - Hands-on Examples
PDF
Andes open cl for RISC-V
PDF
Tools and Techniques for Understanding Threading Behavior in Android
PPTX
PVS-Studio, a solution for resource intensive applications development
PDF
Pydiomatic
PDF
Python idiomatico
PPTX
NvFX GTC 2013
PPTX
Lambdas puzzler - Peter Lawrey
PDF
MultiThreading-in-system-and-android-logcat-42-.pdf
PDF
PVS-Studio for Linux Went on a Tour Around Disney
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
PPTX
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
PDF
Kotlin Backend Development 6 Yrs Recap. The Good, the Bad and the Ugly
PPTX
Track c-High speed transaction-based hw-sw coverification -eve
PDF
JVM Mechanics: When Does the JVM JIT & Deoptimize?
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
How to write clean & testable code without losing your mind
Building Efficient and Highly Run-Time Adaptable Virtual Machines
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
Arvindsujeeth scaladays12
Design Patterns - Compiler Case Study - Hands-on Examples
Andes open cl for RISC-V
Tools and Techniques for Understanding Threading Behavior in Android
PVS-Studio, a solution for resource intensive applications development
Pydiomatic
Python idiomatico
NvFX GTC 2013
Lambdas puzzler - Peter Lawrey
MultiThreading-in-system-and-android-logcat-42-.pdf
PVS-Studio for Linux Went on a Tour Around Disney
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
Kotlin Backend Development 6 Yrs Recap. The Good, the Bad and the Ugly
Track c-High speed transaction-based hw-sw coverification -eve
JVM Mechanics: When Does the JVM JIT & Deoptimize?

More from Stefan Marr (19)

PPTX
Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "U...
PPTX
Seminar on Parallel and Concurrent Programming
PPTX
Why Is Concurrent Programming Hard? And What Can We Do about It?
PPTX
Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
PPTX
Supporting Concurrency Abstractions in High-level Language Virtual Machines
PDF
Identifying A Unifying Mechanism for the Implementation of Concurrency Abstra...
PDF
Sly and the RoarVM: Parallel Programming with Smalltalk
PDF
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
PDF
Sly and the RoarVM: Exploring the Manycore Future of Programming
PDF
PHP.next: Traits
PDF
The Price of the Free Lunch: Programming in the Multicore Era
PDF
Locality and Encapsulation: A Foundation for Concurrency Support in Multi-Lan...
PPTX
Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fi...
PPTX
Encapsulation and Locality: A Foundation for Concurrency Support in Multi-Lan...
PPTX
Intermediate Language Design of High-level Language VMs: Towards Comprehensiv...
PPTX
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from...
PDF
VMADL: An Architecture Definition Language for Variability and Composition ...
PPT
Metaprogrammierung und Reflection
PPT
Traits: A New Language Feature for PHP?
Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "U...
Seminar on Parallel and Concurrent Programming
Why Is Concurrent Programming Hard? And What Can We Do about It?
Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
Supporting Concurrency Abstractions in High-level Language Virtual Machines
Identifying A Unifying Mechanism for the Implementation of Concurrency Abstra...
Sly and the RoarVM: Parallel Programming with Smalltalk
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
Sly and the RoarVM: Exploring the Manycore Future of Programming
PHP.next: Traits
The Price of the Free Lunch: Programming in the Multicore Era
Locality and Encapsulation: A Foundation for Concurrency Support in Multi-Lan...
Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fi...
Encapsulation and Locality: A Foundation for Concurrency Support in Multi-Lan...
Intermediate Language Design of High-level Language VMs: Towards Comprehensiv...
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from...
VMADL: An Architecture Definition Language for Variability and Composition ...
Metaprogrammierung und Reflection
Traits: A New Language Feature for PHP?

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
1. Introduction to Computer Programming.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Tartificialntelligence_presentation.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Machine Learning_overview_presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Building Integrated photovoltaic BIPV_UPV.pdf
Getting Started with Data Integration: FME Form 101
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Advanced methodologies resolving dimensionality complications for autism neur...
Spectroscopy.pptx food analysis technology
Accuracy of neural networks in brain wave diagnosis of schizophrenia
1. Introduction to Computer Programming.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Network Security Unit 5.pdf for BCA BBA.
Tartificialntelligence_presentation.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Spectral efficient network and resource selection model in 5G networks
Agricultural_Statistics_at_a_Glance_2022_0.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
Machine Learning_overview_presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Programs and apps: productivity, graphics, security and other tools
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Building High-Performance Language Implementations With Low Effort

  • 1. Building High-Performance Language Implementations With Low Effort Stefan Marr FOSDEM 2015, Brussels, Belgium January 31st, 2015 @smarr http://stefan-marr.de
  • 2. Why should you care about how Programming Languages work? 2 SMBC: http://www.smbc-comics.com/?id=2088
  • 3. 3 SMBC: http://www.smbc-comics.com/?id=2088 Why should you care about how Programming Languages work? • Performance isn’t magic • Domain-specific languages • More concise • More productive • It’s easier than it looks • Often open source • Contributions welcome
  • 4. What’s “High-Performance”? 4 Based on latest data from http://benchmarksgame.alioth.debian.org/ Geometric mean over available benchmarks. Disclaimer: Not indicate for application performance! Competitively Fast! 0 3 5 8 10 13 15 18 Java V8 C# Dart Python Lua PHP Ruby
  • 5. Small and Manageable 16 260 525 562 1 10 100 1000 What’s “Low Effort”? 5 KLOC: 1000 Lines of Code, without blank lines and comments V8 JavaScript HotSpot Java Virtual Machine Dart VM Lua 5.3 interp.
  • 6. Language Implementation Approaches 6 Source Program Interpreter Run TimeDevelopment Time Input Output Source Program Compiler Binary Input Output Run TimeDevelopment Time Simple, but often slow More complex, but often faster Not ideal for all languages.
  • 7. Modern Virtual Machines 7 Source Program Interpreter Run TimeDevelopment Time Input Output Binary Runtime Info Compiler Virtual Machine with Just-In-Time Compilation
  • 8. VMs are Highly Complex 8 Interpreter Input Output Compiler Optimizer Garbage Collector CodeGen Foreign Function Interface Threads and Memory Model How to reuse most parts for a new language? Debugging Profiling … Easily 500 KLOC
  • 9. How to reuse most parts for a new language? 9 Input Output Make Interpreters Replaceable Components! Interpreter Compiler Optimizer Garbage Collector CodeGen Foreign Function Interface Threads and Memory Model Garbage Collector … Interpreter Interpreter …
  • 10. Interpreter-based Approaches Truffle + Graal with Partial Evaluation Oracle Labs RPython with Meta-Tracing [3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204. [2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
  • 11. SELF-OPTIMIZING TREES A Simple Technique for Language Implementation and Optimization [1] Würthinger, T.; Wöß, A.; Stadler, L.; Duboscq, G.; Simon, D. & Wimmer, C. (2012), Self- Optimizing AST Interpreters, in 'Proc. of the 8th Dynamic Languages Symposium' , pp. 73-82.
  • 13. A Simple Abstract Syntax Tree Interpreter 13 root_node = parse(file) root_node.execute(Frame()) if (condition) { cnt := cnt + 1; } else { cnt := 0; } cnt 1 + cnt: = if cnt: = 0 cond root_node
  • 14. Implementing AST Nodes 14 if (condition) { cnt := cnt + 1; } else { cnt := 0; } class Literal(ASTNode): final value def execute(frame): return value class VarWrite(ASTNode): child sub_expr final idx def execute(frame): val := sub_expr.execute(frame) frame.local_obj[idx]:= val return val class VarRead(ASTNode): final idx def execute(frame): return frame.local_obj[idx] cnt 1 + cnt: = if cnt: = 0 cond
  • 15. Self-Optimization by Node Specialization 15 cnt := cnt + 1 def UninitVarWrite.execute(frame): val := sub_expr.execute(frame) return specialize(val). execute_evaluated(frame, val) uninitialized variable write cnt 1 + cnt: = cnt: = def UninitVarWrite.specialize(val): if val instanceof int: return replace(IntVarWrite(sub_expr)) elif …: … else: return replace(GenericVarWrite(sub_expr)) specialized
  • 16. Self-Optimization by Node Specialization 16 cnt := cnt + 1 def IntVarWrite.execute(frame): try: val := sub_expr.execute_int(frame) return execute_eval_int(frame, val) except ResultExp, e: return respecialize(e.result). execute_evaluated(frame, e.result) def IntVarWrite.execute_eval_int(frame, anInt): frame.local_int[idx] := anInt return anInt int variable write cnt 1 + cnt: =
  • 17. Some Possible Self-Optimizations • Type profiling and specialization • Value caching • Inline caching • Operation inlining • Library Lowering 17
  • 18. Library Lowering for Array class createSomeArray() { return Array.new(1000, ‘fast fast fast’); } 18 class Array { static new(size, lambda) { return new(size).setAll(lambda); } setAll(lambda) { forEach((i, v) -> { this[i] = lambda.eval(); }); } } class Object { eval() { return this; } }
  • 19. Optimizing for Object Values 19 createSomeArray() { return Array.new(1000, ‘fast fast fast’); } .new Array global lookup method invocation 1000 int literal ‘fast’ string literal Object, but not a lambda Optimization potential
  • 20. Specialized new(size, lambda) def UninitArrNew.execute(frame): size := size_expr.execute(frame) val := val_expr.execute(frame) return specialize(size, val). execute_evaluated(frame, size, val) 20 createSomeArray() { return Array.new(1000, ‘fast fast fast’); } def UninitArrNew.specialize(size, val): if val instanceof Lambda: return replace(StdMethodInvocation()) else: return replace(ArrNewWithValue())
  • 21. Specialized new(size, lambda) def ArrNewWithValue.execute_evaluated(frame, size, val): return Array([val] * 1000) 21 createSomeArray() { return Array.new(1000, ‘fast fast fast’); } 1 specialized node vs. 1000x `this[i] = lambda.eval()` 1000x `eval() { return this; }` .new Array global lookup 1000 int literal ‘fast’ string literal specialized
  • 23. How to Get Fast Program Execution? 23 VarWrite.execute(frame) IntVarWrite.execute(frame) VarRead.execute(frame) Literal.execute(frame) ArrayNewWithValue.execute(frame) ..VW_execute() # bin ..IVW_execute() # bin ..VR_execute() # bin ..L_execute() # bin ..ANWV_execute() # bin Standard Compilation: 1 node at a time Minimal Optimization Potential
  • 24. Problems with Node-by-Node Compilation 24 cnt 1 + cnt: = Slow Polymorphic Dispatches def IntVarWrite.execute(frame): try: val := sub_expr.execute_int(frame) return execute_eval_int(frame, val) except ResultExp, e: return respecialize(e.result). execute_evaluated(frame, e.result) cnt: = Runtime checks in general
  • 25. Compilation Unit based on User Program Meta-Tracing Partial Evaluation Guided By AST 25 cnt 1 + cnt: = if cnt: = 0 cnt 1 + cnt: =if cnt: = 0 [3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204. [2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
  • 27. RPython • Subset of Python – Type-inferenced • Generates VMs 27 Interpreter source RPython Toolchain Meta-Tracing JIT Compiler Interpreter http://rpython.readthedocs.org/ Garbage Collector …
  • 28. Meta-Tracing of an Interpreter 28 cnt 1 +cnt:= if cnt:= 0 [2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
  • 29. Meta Tracers need to know the Loops class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: jit_merge_point(node=self) cond = cond_expr.execute_bool(frame) if not cond: break body_expr.execute(frame) 29 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan))
  • 30. Tracing Records one Concrete Execution class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right 30 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead))
  • 31. Tracing Records one Concrete Execution class IntVarRead(ASTNode): final idx def execute_int(frame): if frame.is_int(idx): return frame.local_int[idx] else: new_node = respecialize() raise UnexpectedResult(new_node.execute()) 31 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1)
  • 32. Tracing Records one Concrete Execution class IntVarRead(ASTNode): final idx def execute_int(frame): if frame.is_int(idx): return frame.local_int[idx] else: new_node = respecialize() raise UnexpectedResult(new_node.execute()) 32 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT))
  • 33. Tracing Records one Concrete Execution class IntVarRead(ASTNode): final idx def execute_int(frame): if frame.is_int(idx): return frame.local_int[idx] else: new_node = respecialize() raise UnexpectedResult(new_node.execute()) 33 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3]
  • 34. Tracing Records one Concrete Execution class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right 34 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult)
  • 35. Tracing Records one Concrete Execution class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right 35 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult) guard(right_expr == Const(IntLiteral))
  • 36. Tracing Records one Concrete Execution class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right 36 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult) guard(right_expr == Const(IntLiteral)) i5 := right_expr.value # Const(100)
  • 37. Tracing Records one Concrete Execution class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right 37 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult) guard(right_expr == Const(IntLiteral)) i5 := right_expr.value # Const(100) guard_no_exception(Const(UnexpectedResult)
  • 38. Tracing Records one Concrete Execution class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right 38 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult) guard(right_expr == Const(IntLiteral)) i5 := right_expr.value # Const(100) guard_no_exception(Const(UnexpectedResult) b1 := i4 < i5
  • 39. Tracing Records one Concrete Execution class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: jit_merge_point(node=self) cond = cond_expr.execute_bool(frame) if not cond: break body_expr.execute(frame) 39 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult) guard(right_expr == Const(IntLiteral)) i5 := right_expr.value # Const(100) guard_no_exception(Const(UnexpectedResult) b1 := i4 < i5 guard_true(b1)
  • 40. Tracing Records one Concrete Execution class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: jit_merge_point(node=self) cond = cond_expr.execute_bool(frame) if not cond: break body_expr.execute(frame) 40 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult) guard(right_expr == Const(IntLiteral)) i5 := right_expr.value # Const(100) guard_no_exception(Const(UnexpectedResult) b1 := i4 < i5 guard_true(b1) ...
  • 41. Traces are Ideal for Optimization guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception( Const(UnexpectedResult)) guard(right_expr == Const(IntLiteral)) i5 := right_expr.value # Const(100) guard_no_exception( Const(UnexpectedResult)) b1 := i4 < i5 guard_true(b1) ... i1 := left_expr.idx # Const(1) a1 := frame.layout i1 := a1[Const(1)] guard(i1 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] i5 := right_expr.value # Const(100) b1 := i2 < i5 guard_true(b1) ... a1 := frame.layout i1 := a1[1] guard(i1 == F_INT) a2 := frame.local_int i2 := a2[1] b1 := i2 < 100 guard_true(b1) ...
  • 42. Truffle + Graal Just-in-Time Compilation with Partial Evaluation Oracle Labs
  • 43. Truffle+Graal • Java framework – AST interpreters • Based on HotSpot JVM 43 Interpreter Graal Compiler + Truffle Partial Evaluator http://www.ssw.uni-linz.ac.at/Research/Projects/JVM/Truffle.html http://www.oracle.com/technetwork/oracle-labs/program-languages/overview/index-2301583.html Garbage Collector … + Truffle Framework HotSpot JVM
  • 44. Partial Evaluation Guided By AST 44 cnt 1 +cnt:= if cnt:= 0 [3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204.
  • 45. Partial Evaluation inlines based on Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: cond = cond_expr.execute_bool(frame) if not cond: break body_expr.execute(frame) 45 while (cnt < 100) { cnt := cnt + 1; }
  • 46. Partial Evaluation inlines based on Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: cond = cond_expr.execute_bool(frame) if not cond: break body_expr.execute(frame) 46 while (cnt < 100) { cnt := cnt + 1; } class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right
  • 47. Partial Evaluation inlines based on Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: try: left = cond_expr.left_expr.execute_int() except UnexpectedResult r: ... try: right = cond_expr.right_expr.execute_int() expect UnexpectedResult r: ... cond = left < right if not cond: break body_expr.execute(frame) 47 while (cnt < 100) { cnt := cnt + 1; }
  • 48. Partial Evaluation inlines based on Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: try: left = cond_expr.left_expr.execute_int() except UnexpectedResult r: ... try: right = cond_expr.right_expr.execute_int() expect UnexpectedResult r: ... cond = left < right if not cond: break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; } class IntVarRead(ASTNode): final idx def execute_int(frame): if frame.is_int(idx): return frame.local_int[idx] else: new_node = respecialize() raise UnexpectedResult(new_node.ex
  • 49. Partial Evaluation inlines based on Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: try: if frame.is_int(1): left = frame.local_int[1] else: new_node = respecialize() raise UnexpectedResult(new_node.execute()) except UnexpectedResult r: ... try: right = cond_expr.right_expr.execute_int() expect UnexpectedResult r: ... cond = left < right if not cond: break while (cnt < 100) { cnt := cnt + 1; }
  • 50. Optimize Optimistically class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: try: if frame.is_int(1): left = frame.local_int[1] else: new_node = respecialize() raise UnexpectedResult(new_node.execute()) except UnexpectedResult r: ... try: right = cond_expr.right_expr.execute_int() expect UnexpectedResult r: ... cond = left < right if not cond: break while (cnt < 100) { cnt := cnt + 1; }
  • 51. Optimize Optimistically class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: if frame.is_int(1): left = frame.local_int[1] else: __deopt_return_to_interp() try: right = cond_expr.right_expr.execute_int() expect UnexpectedResult r: ... cond = left < right if not cond: break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; }
  • 52. Partial Evaluation inlines based on Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: if frame.is_int(1): left = frame.local_int[1] else: __deopt_return_to_interp() try: right = cond_expr.right_expr.execute_int() expect UnexpectedResult r: ... cond = left < right if not cond: break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; } class IntLiteral(ASTNode): final value def execute_int(frame): return value
  • 53. Partial Evaluation inlines based on Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: if frame.is_int(1): left = frame.local_int[1] else: __deopt_return_to_interp() try: right = 100 expect UnexpectedResult r: ... cond = left < right if not cond: break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; } class IntLiteral(ASTNode): final value def execute_int(frame): return value
  • 54. Classic Optimizations: Dead Code Elimination class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: if frame.is_int(1): left = frame.local_int[1] else: __deopt_return_to_interp() try: right = 100 expect UnexpectedResult r: ... cond = left < right if not cond: break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; } class IntLiteral(ASTNode): final value def execute_int(frame): return value
  • 55. Classic Optimizations: Constant Propagation class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: if frame.is_int(1): left = frame.local_int[1] else: __deopt_return_to_interp() right = 100 cond = left < right if not cond: break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; } class IntLiteral(ASTNode): final value def execute_int(frame): return value
  • 56. Classic Optimizations: Loop Invariant Code Motion class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: if frame.is_int(1): left = frame.local_int[1] else: __deopt_return_to_interp() if not (left < 100): break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; }
  • 57. class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): if not frame.is_int(1): __deopt_return_to_interp() while True: if not (frame.local_int[1] < 100): break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; } Classic Optimizations: Loop Invariant Code Motion
  • 58. Compilation Unit based on User Program Meta-Tracing Partial Evaluation Guided by AST 58 cnt 1 + cnt: = if cnt: = 0 cnt 1 + cnt: =if cnt: = 0 [3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204. [2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
  • 59. WHAT’S POSSIBLE FOR A SIMPLE INTERPRETER? Results 59
  • 60. Designed for Teaching: • Simple • Conceptual Clarity • An Interpreter family – in C, C++, Java, JavaScript, RPython, Smalltalk Used in the past by: http://som-st.github.io 60
  • 61. Self-Optimizing SOMs 61 SOMME RTruffleSOM Meta-Tracing RPython SOMPE TruffleSOM Partial Evaluation + Graal Compiler on the HotSpot JVM JIT Compiled JIT Compiled github.com/SOM-st/TruffleSOMgithub.com/SOM-st/RTruffleSOM
  • 62. Java 8 -server vs. SOM+JIT JIT-compiled Peak Performance 62 3.5x slower (min. 1.6x, max. 6.3x) RPython 2.8x slower (min. 3%, max. 5x) Truffle+Graal Compiled SOMMT Compiled SOMPE ●●● ●●● ●●●●●●●●●● ● ●●●●●● ●●●● ●● ●● ● ●●●●●● ●●●●●●●●●●● ●●● ●●●●●●● ● ●● ● ●●● ●●●● ● ●●●●●●●● ● ●●●● ●●● ●●●●●●●●●●● ● ● ● ●●● ●●●●●● ● ●●●●●●● ● ●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ●●●●●●●● ● ● ● ●●●●●●●●●● ● ● ● ●● 1 4 8 Bounce BubbleSort DeltaBlue Fannkuch Mandelbrot NBody Permute Queens QuickSort Richards Sieve Storage Towers Bounce BubbleSort DeltaBlue Fannkuch Mandelbrot NBody Permute Queens QuickSort Richards Sieve Storage Towers Runtimenormalizedto Java(compiledorinterpreted)
  • 63. Implementation: Smaller Than Lua 63 Meta-Tracing SOMMT (RTruffleSOM) Partial Evaluation SOMPE (TruffleSOM) KLOC: 1000 Lines of Code, without blank lines and comments 4.2 9.8 16 260 525 562 1 10 100 1000 V8 JavaScript HotSpot Java Virtual Machine Dart VM Lua 5.3 interp.
  • 65. Simple and Fast Interpreters are Possible! • Self-optimizing AST interpreters • RPython or Truffle for JIT Compilation 65 [1] Würthinger et al., Self-Optimizing AST Interpreters, Proc. of the 8th Dynamic Languages Symposium, 2012, pp. 73-82. [3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204. [4] Marr et al., Are We There Yet? Simple Language Implementation Techniques for the 21st Century. IEEE Software 31(5):60—67, 2014 [2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25. Literature on the ideas:
  • 66. RPython • #pypy on irc.freenode.net • rpython.readthedocs.org • Kermit Example interpreter https://bitbucket.org/pypy/example-interpreter • A Tutorial http://morepypy.blogspot.be/2011/04/tutorial- writing-interpreter-with-pypy.html • Language implementations https://www.evernote.com/shard/s130/sh/4d42 a591-c540-4516-9911- c5684334bd45/d391564875442656a514f7ece5 602210 Truffle • http://mail.openjdk.java.net/ mailman/listinfo/graal-dev • SimpleLanguage interpreter https://github.com/OracleLabs/GraalVM/tree/mast er/graal/com.oracle.truffle.sl/src/com/oracle/truffle /sl • A Tutorial http://cesquivias.github.io/blog/2014/10/13/writin g-a-language-in-truffle-part-1-a-simple-slow- interpreter/ • Project – http://www.ssw.uni- linz.ac.at/Research/Projects/JVM/Truffle.html – http://www.oracle.com/technetwork/oracle- labs/program-languages/overview/index- 2301583.html 66 Big Thank You! to both communities, for help, answering questions, debugging support, etc…!!!
  • 67. Languages: Small, Elegant, and Fast! 67 cn t 1 + cnt: = if cnt: = 0 cnt 1 +cnt:= if cnt:= 0 Compiled SOMMT Compiled SOMPE ●●● ●●● ●●●●●●●●●● ● ●●●●●● ●●●● ●● ●● ● ●●●●●● ●●●●●●●●●●● ●●● ●●●●●●● ● ●● ● ●●● ●●●● ● ●●●●●●●● ● ●●●● ●●● ●●●●●●●●●●● ● ● ● ●●● ●●●●●● ● ●●●●●●● ● ●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ●●●●●●●● ● ● ● ●●●●●●●●●● ● ● ● ●● 1 4 8 Bounce BubbleSort DeltaBlue Fannkuch Mandelbrot NBody Permute Queens QuickSort Richards Sieve Storage Towers Bounce BubbleSort DeltaBlue Fannkuch Mandelbrot NBody Permute Queens QuickSort Richards Sieve Storage Towers Runtimenormalizedto Java(compiledorinterpreted) 3.5x slower (min. 1.6x, max. 6.3x) 4.2 KLOC RPython 2.8x slower (min. 3%, max. 5x) 9.8 KLOC Truffle+Graal @smarr | http://stefan-marr.de

Editor's Notes

  • #15: Self-opt interpreters good way to communicate to compiler ASTs Nodes specialize themselves at runtime Based on observed types or values Using speculation And fallback handling This communicates essential information to optimizer
  • #16: def IntVarWrite.execute_eval_int(frame, anInt): frame.local_int[idx] := anInt
  • #17: def IntVarWrite.execute_eval_int(frame, anInt): frame.local_int[idx] := anInt
  • #26: It is about how to determine the compilation unit. Remember, the interpreter is implemented in one language, and the compilation works on the meta-level. The main idea is that we want to take the implementation, add information from the execution context, and use that to do very aggressive and speculative optimizations on the interpreter implementation. This avoids the need to write custom JIT compilers.
  • #29: It is about how to determine the compilation unit. Remember, the interpreter is implemented in one language, and the compilation works on the meta-level. The main idea is that we want to take the implementation, add information from the execution context, and use that to do very aggressive and speculative optimizations on the interpreter implementation. This avoids the need to write custom JIT compilers.
  • #42: - No control flow - Just all instructions directly layed out Ideal to identify data dependencies Remove redundant operations Flatten abstraction levels for frameworks, etc
  • #45: It is about how to determine the compilation unit. Remember, the interpreter is implemented in one language, and the compilation works on the meta-level. The main idea is that we want to take the implementation, add information from the execution context, and use that to do very aggressive and speculative optimizations on the interpreter implementation. This avoids the need to write custom JIT compilers.
  • #59: It is about how to determine the compilation unit. Remember, the interpreter is implemented in one language, and the compilation works on the meta-level. The main idea is that we want to take the implementation, add information from the execution context, and use that to do very aggressive and speculative optimizations on the interpreter implementation. This avoids the need to write custom JIT compilers.