SlideShare a Scribd company logo
Programming Language Memory
             Models:
  What do Shared Variables Mean?

            Hans-J. Boehm
              HP Labs


7/28/2011                          1
The problem
• Shared memory parallel programs are built on shared
  variables visible to multiple threads of control.
• But there is a lot of confusion about what those variables
  mean:
     –   Are concurrent accesses allowed?
     –   What is a concurrent access?
     –   When do updates become visible to other threads?
     –   Can an update be partially visible?
• Many recent efforts with serious technical issues:
     – Java, OpenMP 3.0, UPC, Go …



7/28/2011                                                   2
Credits and Disclaimers:
• Much of this work was done by others or jointly. I‟m
  relying particularly on:
     – Basic approach: Sarita Adve, Mark Hill, Ada 83 …
     – JSR 133: Also Jeremy Manson, Bill Pugh, Doug Lea
     – C++11: Lawrence Crowl, Clark Nelson, Paul McKenney, Herb
       Sutter, …
     – Improved hardware models: Peter Sewell‟s group, many Intel,
       AMD, ARM, IBM participants …
     – Conflict exception work: Ceze, Lucia, Qadeer, Strauss
     – Recent Java Memory Model work: Sevcik, Aspinall, Cenciarelli
     – Race-free optimization: Pramod Joisha, Laura Effinger-Dean,
       Dhruva Chakrabarti
• But some of it is still controversial.
     – This reflects my personal views.
7/28/2011                                                             3
Outline
• Emerging consensus:
     – Interleaving semantics (Sequential Consistency)
     – But only for data-race-free programs
• Brief discussion of consequences
     – Software requirements
     – Hardware requirements
• Major remaining problem:
     – Java can‟t outlaw races.
     – We don‟t know how to give meaning to data races.
     – Some speculative solutions.

7/28/2011                                                 4
Naive threads programming model
     (Sequential Consistency)
• Threads behave as though their memory
  accesses were simply interleaved. (Sequential
  consistency)

        Thread 1              Thread 2
        x = 1;                y = 2;
        z = 3;

     – might be executed as
       x = 1; y = 2; z = 3;
7/28/2011                                         5
Locks restrict interleavings
          Thread 1                 Thread 2
          lock(l);                 lock(l);
          r1 = x;                  r2 = x;
          x = r1+1;                x = r2+1;
          unlock(l);               unlock(l);

     – can only be executed as
          lock(l); r1 = x; x = r1+1; unlock(l); lock(l);
          r2 = x; x = r2+1; unlock(l);
     or
        lock(l); r2 = x; x = r2+1; unlock(l); lock(l);
        r1 = x; x = r1+1; unlock(l);
     since second lock(l) must follow first unlock(l)

7/28/2011                                                  6
Atomic sections / transactional memory are
       just like a single global lock.




7/28/2011                                           7
But this doesn‟t quite work …
• Limits reordering and other hardware/compiler
  transformations
     – “Dekker‟s” example (everything initially zero) should
       allow r1 = r2 = 0:
            Thread 1                Thread 2
            x = 1;                  y = 1;
            r1 = y;                 r2 = x;
     – Compilers like to move up loads.
     – Hardware likes to buffer stores.


7/28/2011                                                      8
Sensitive to memory access
                     granularity
            Thread 1             Thread 2
            x = 300;             x = 100;

• If memory is accessed a byte at a time, this may be
  executed as:
      x_high = 0;
      x_high = 1; // x = 256
      x_low = 44; // x = 300;
      x_low = 100; // x = 356;


7/28/2011                                               9
And this is at too low a level …
• And taking advantage of sequential consistency
  involves reasoning about memory access
  interleaving:
     – Much too hard.
     – Want to reason about larger “atomic” code regions
             • which can‟t be visibly interleaved.




09/08/2010                                                 10
Real threads programming model
                 (1)
 • Two memory accesses conflict if they
      – access the same scalar object*, e.g. variable.
      – at least one access is a store.
      – E.g. x = 1; and r2 = x; conflict
 • Two ordinary memory accesses participate in a data
   race if they
      – conflict, and
      – can occur simultaneously
             • i.e. appear as adjacent operations (diff. threads) in interleaving.
 • A program is data-race-free (on a particular input) if no
   sequentially consistent execution results in a data race.

              * or contiguous sequence of bit-fields
09/08/2010                                                                           11
Real threads programming model
                 (2)
• Sequential consistency only for data-race-
  free programs!
     – Avoid anything else.
• Data races are prevented by
     – locks (or atomic sections) to restrict
       interleaving
     – declaring synchronization variables
             • (in two slides …)


09/08/2010                                      12
Alternate data-race-free definition:
          happens-before
• Memory access a happens-before b if
   – a precedes b in program order.                 Thread 1:
   – a and b are synchronization operations,
     and b observes the results of a, thus          lock(m);
     enforcing ordering between them.               t = x + 1;
      • e.g. a unlocks m, b subsequently locks m.
   – Or there is a c such that a happens-before x = t;
     c and c happens-before b.                  unlock(m)
• Two ordinary conflicting memory
  operations a and b participate in a                           Thread 2:
  data race if neither                                          lock(m);
   – a happens-before b,                                        t = x + 1;
   – nor b happens-before a.
• Set of data-race-free programs usually the                    x = t;
  same.                                                         unlock(m)
Synchronization variables
•   Java: volatile, java.util.concurrent.atomic.
•   C++11: atomic<int>, atomic_int
•   C1x: _Atomic(int), _Atomic int, atomic_int
•   Guarantee indivisibility of operations.
•   “Don‟t count” in determining whether there is a data race:
     – Programs with “races” on synchronization variables are still
       sequentially consistent.
     – Though there may be “escapes” (Java, C++11).
• Dekker‟s algorithm “just works” with synchronization
  variables.



7/28/2011                                                             14
As expressive as races
Double-checked locking:      Double-checked locking:
Wrong!                       Correct (C++11):
bool x_init;                 atomic<bool> x_init;

if (!x_init) {               if (!x_init) {
   l.lock();                    l.lock();
   if (!x_init) {               if (!x_init) {
            initialize x;           initialize x;
            x_init = true;          x_init = true;
     }                           }
     l.unlock();                 l.unlock();
}                            }
use x;                       use x;


7/28/2011                                              15
Data Races Revisited
• Are defined in terms of sequentially
  consistent executions.
• If x and y are initially zero, this does not
  have a data race:
            Thread 1       Thread 2
            if (x)         if (y)
              y = 1;         x = 1;




7/28/2011                                        16
SC for DRF programming model
         advantages over SC
• Supports important hardware & compiler optimizations.
• DRF restriction  Synchronization-free code sections
  appear to execute atomically, i.e. without visible
  interleaving.
     – If one didn‟t:

 Thread 1 (not atomic):   Thread 2(observer):
      a = 1;
                           if (a == 1 && b == 0) {

      b = 1;
                               …
                           }


09/08/2010                                                17
SC for DRF implementation model
               (1)
• Very restricted reordering of                 lock();
  memory operations around                       synch-free
  synchronization operations:                      code
  – Compiler either understands these, or
                                                  region
    treats them as opaque, potentially
    updating any location.                      unlock();
  – Synchronization operations include
                                                 synch-free
    instructions to limit or prevent
    hardware reordering (“memory                   code
    fences”).                                     region
     • e.g. lock acquisition, release, atomic
       store, might contain memory fences.
SC for DRF implementation model
              (2)
• Code may be reordered                  lock();
  between synchronization
  operations.                             synch-free

  – Another thread can only tell if it      code
    accesses the same data between         region
    reordered operations.
  – Such an access would be a data       unlock();
    race.
                                          synch-free
• If data races are disallowed              code
  (e.g. Posix, Ada, C++11, not
  Java), compiler may assume               region
  that variables don‟t change
  asynchronously.
Data
                                                            races
               Some variants                                 are
                                                            errors

C++11 draft     SC for DRF,                                 X
                with explicit easily avoidable exceptions
C1x draft
Java            SC for DRF, with some exceptions
                More details later.

Ada83           SC for DRF (?)                              X
Pthreads        SC for DRF (??)                             X
OpenMP          SC for DRF (??, except atomics)             X
Fortran 2008    SC for DRF (??, except atomics)             X
C#, .Net        SC for DRF when restricted to
                locks + Interlocked (???)
7/28/2011                                                       20
The exceptions in C++11:
atomic<bool> x_init;

if (!x_init.load(memory_order_acquire) {
   l.lock();
   if (!x_init.load(memory_order_relaxed) {
         initialize x;
         x_init.store(true, memory_order_release);
    }
    l.unlock();
}
use x;

We‟ll ignore this for the rest of the talk …
Data Races as Errors
• In many languages, data races are errors
     – In the sense of out-of-bounds array stores in C.
     – Anything can happen:
         • E.g. program can crash, local variables can appear
           to change, etc.
     – There are no benign races in e.g. C++11 or Ada 83.
     – Compilers may, and do, assume there are no races.
       If that assumption is false, all bets are off.
     – See e.g. Boehm, How to miscompile programs with
       “benign” data races, HotPar „11.
      .
7/28/2011                                                  22
How a data race may cause a
                    crash
unsigned x;                       • Assume switch
                                    statement compiled as
If (x < 3) {                        branch table.
                                  • May assume x is in
     … // async x change
                                    range.
     switch(x) {
                                  • Asynchronous change to
         case 0: …                  x causes wild branch.
         case 1: …                   – Not just wrong value.
         case 2: …
     }
}
23                         28 July 2011
Note: We can usually ignore
              wait/notify
• In most languages wait can return without
  notify.
     – “Spurious wakeup”
• Except for termination issues:
     – wait() ≡ unlock(); lock()
     – notify() ≡ nop
• Applies to existence of data races, partial
  correctness, flow analysis.
7/28/2011                                       24
Another note on data race
                     definition
• We define it in terms of scalar accesses, but
• Container libraries should ensure that
            Container accesses don‟t race 
            No races on memory locations
• This means
     – Accesses to hidden shared state (caches,
       allocation) must be locked by implementation.
     – User must lock for container-level races.
• This is often the correct library thread-safety
  condition.
7/28/2011                                              25
Outline
• Emerging consensus:
     – Interleaving semantics (Sequential Consistency)
     – But only for data-race-free programs
• Brief discussion of consequences
     – Software requirements
     – Hardware requirements
• Major remaining problem:
     – Java can‟t outlaw races.
     – We don‟t know how to give meaning to data races.
     – Some speculative solutions.

7/28/2011                                                 26
Compilers must not introduce data
             races
• Single thread compilers currently may add
  data races: (PLDI 05)
     struct {char a; char b} x;
                                  tmp = x;
     x.a = „z‟;                   tmp.a = „z‟;
                                  x = tmp;

     – x.a = 1 in parallel with x.b = 1 may fail to
       update x.b.
• … and much more interesting examples.
• Still broken in gcc with bit-fields.
7/28/2011                                         27
Some restrictions are a bit more
               annoying:
• Compiler may not introduce “speculative” stores:
     int count;   // global, possibly shared
     …
     for (p = q; p != 0; p = p -> next)
        if (p -> data > 0) ++count;




     int count;   // global, possibly shared
     …
     reg = count;
     for (p = q; p != 0; p = p -> next)
        if (p -> data > 0) ++reg;
     count = reg; // may spuriously assign to count

28                       28 July 2011
Introducing data races:
            Potentially infinite loops
                        ?
while (…) { *p = 1; }       x = 1;
x = 1;                      while (…) { *p = 1; }


• x not otherwise accessed by loop.
• Prohibited in Java.
• Allowed by giving undefined behavior to such
  infinite loops in C++0x. (Controversial.)
• Common text book analyses treat as equivalent?

7/28/2011                                           29
Note on program analysis for
             optimization
• Lots of research papers on analysis of
  sequentially consistent parallel programs.
• AFAIK: Real compilers don’t do that!
     – For good reason.
     – Use sequential optimization in sync-free regions.
• SC analysis is
     – Generally sound, but pessimistic for C++11 (minus
       SC escapes)
     – Very much so with incomplete information
     – Unsound for Java (but …)
7/28/2011                                                  30
Some positive optimization
            consequences
• Rules are finally clear!
     – Optimization does not need to preserve full sequential
       consistency!
• In languages that outlaw data races
     – Compilers already use that.
     – It can be leveraged more aggressively:
     – Assuming no other synchronization, x not motified by this thread,
       x is loop invariant in
         while (…) {
              … x …
              l.lock(); … l.unlock();
         }

7/28/2011                                                             31
Language spec challenge 1
• Some common language features almost
  unavoidably introduce data races.
• Most significant example:
     – Detached/daemon threads, combined with
     – Destructors / finalizers for static variables
     – Detached threads call into libraries that may
       access static variables
            • Even while they‟re being cleaned up.


7/28/2011                                              32
Language spec challenge 2:
    • Some really awful code:
                    Thread 2:       Don’t try this at home!!
Thread 1:

    x = 42;            while (m.trylock()==SUCCESS)
?   m.lock();            m.unlock();
                       assert (x == 42);
                                       •   Disclaimer: Example requires
    •   Can the assertion fail?            tweaking to be pthreads-
                                           compliant.
    •   Many implementations: Yes
    •   Traditional specs: No. C++11: Yes
    •   Trylock() can effectively fail spuriously!
    09/08/2010                                                       33
Outline
• Emerging consensus:
     – Interleaving semantics (Sequential Consistency)
     – But only for data-race-free programs
• Brief discussion of consequences
     – Software requirements
     – Hardware requirements
• Major remaining problem:
     – Java can‟t outlaw races.
     – We don‟t know how to give meaning to data races.
     – Some speculative solutions.

7/28/2011                                                 34
Byte store instructions
• x.c = „a‟; may not visibly read and
  rewrite adjacent fields.
• Byte stores must be implemented with
     – Byte store instruction, or
     – Atomic read-modify-write.
            • Typically expensive on multiprocessors.
            • Often cheaply implementable on uniprocessors.



7/28/2011                                                     35
Sequential consistency must be
              enforceable
• Programs using only synchronization variables
  must be sequentially consistent.
• Compiler literature contains many papers on
  enforcing sequential consistency by adding
  fences. But:
     – Not really possible on Itanium.
     – Wasn‟t possible on X86 until the re-revision of the
       spec last year.
     – Took months of discussions with PowerPC architects
       to conclude it‟s (barely, sort of) possible there.
• The core issue is “write atomicity”:
7/28/2011                                                36
Can fences enforce SC?
Unclear that hardware fences can ensure sequential
consistency. “IRIW” example:
x, y initially zero. Fences between every instruction pair!
 Thread 1:     Thread 2:              Thread 3:     Thread 4:
 x = 1;        r1 = x; (1)            y = 1;        r3 = y; (1)
               fence;                               fence;
               r2 = y; (0)                          r4 = x; (0)

                x set first!                       y set first!

 Fully fenced, not sequentially consistent. Does hardware allow it?

7/28/2011                                                             37
Why does it matter?
• Nobody cares about IRIW!?
• It‟s a pain to enforce on at least PowerPC.
• Many people (Sarita Adve, Doug Lea,
  Vijay Saraswat) spent about a year trying
  to relax SC requirement here.
• (Personal opinion) The results were
  incomprehensible, and broke more
  important code.
• No viable alternatives!
7/28/2011                                   38
Acceptable hardware memory
             models
• More challenging requirements:
     1.     Precise memory model specification
     2.     Byte stores
     3.     Cheap mechanism to enforce write atomicity
     4.     Dirt cheap mechanism to enforce data
            dependency ordering(?) (Java final fields)
• Other than that, all standard approaches
  appear workable, but …

7/28/2011                                            39
Replace fences completely?
        Synchronization variables on X86
• atomic store:     ~1 cycle   dozens of cycles
     – store (mov); mfence;
• atomic load:      ~1 cycle
     – load (mov)
• Store implicitly ensures that prior memory
  operations become visible before store.
• Load implicitly ensures that subsequent memory
  operations become visible later.
• Sole reason for mfence: Order atomic store
  followed by atomic load.
7/28/2011                                         40
Fence enforces all kinds of
 additional, unobservable orderings
• s is a synchronization variable:
       x = 1;
       s = 2; // includes fence
       r1 = y;
• Prevents reordering of x = 1 and r1 = y;
   – final load delayed until assignment to a visible.
• But this ordering is invisible to non-racing threads
     – …and expensive to enforce?
• We need a tiny fraction of mfence functionality.


7/28/2011                                                41
Outline
• Emerging consensus:
     – Interleaving semantics (Sequential Consistency)
     – But only for data-race-free programs
• Brief discussion of consequences
     – Software requirements
     – Hardware requirements
• Major remaining problem:
     – Java can‟t outlaw races.
     – We don‟t know how to give meaning to data races.
     – Some speculative solutions.

7/28/2011                                                 42
Data Races in Java
• C++11 leaves data race semantics
  undefined.
     – “catch fire” semantics
• Java supports sand-boxed code.
• Don‟t know how to prevent data-races in
  sand-boxed, malicious code.
• Java must provide some guarantees in the
  presence of data races.
7/28/2011                                43
Interesting data race outcome?
            x, y initially null,
            Loads may or may not see racing stores?

Thread 1:                            Thread 2:
  r1 = x;                              r2 = y;
  y = r1;                              x = r2;



            Outcome: x = y = r1 = r2 =
            “<your bank password here>”


7/28/2011                                             44
The Java Solution
Quotation from 17.4.8, Java Language Specification, 3rd
edition, omitted, to avoid possible copyright questions. It
addresses the issue by explicitly outlawing the causality
cycles required to get the dubious result from the last slide.

The important point is that this is a rather complex
mathematical specification.




                                               …
7/28/2011                                                    45
Complicated, but nice properties?
• Manson, Pugh, Adve: The Java Memory
  Model, POPL 05


 Quotation from section 9.1.2 of above paper omitted, to avoid
 possible copyright questions. This asserts (Theorem 1) that
 non-conflicting operations may be reordered by a compiler.




7/28/2011                                                        46
Much nicer than prior attempts, but:
• Aspinall, Sevcik, “Java Memory Model Examples: Good,
  Bad, and Ugly”, VAMP 2007 (also ECOOP 2008 paper)



Quotation from above paper omitted, to avoid possible
copyright questions. This ends in the statement:
“This falsifies Theorem 1 of [paper from previous slide].”



Note: The underlying observation is due to Pietro Cenciarelli.



7/28/2011                                                        47
Why is this hard?
• Want
     – Constrained race semantics for essential
       security properties.
     – Unconstrained race semantics to support
       compiler and hardware optimizations.
     – Simplicity.
• No known good resolution.


7/28/2011                                         48
Outline
• Emerging consensus:
     – Interleaving semantics (Sequential Consistency)
     – But only for data-race-free programs
• Brief discussion of consequences
     – Software requirements
     – Hardware requirements
• Major remaining problem:
     – Java can‟t outlaw races.
     – We don‟t know how to give meaning to data races.
     – Some speculative solutions.

7/28/2011                                                 49
A Different Approach
• Outlaw data races.
• Require violations to be detectable!
     – Even in malicious sand-boxed code.


• Possible approaches:
     – Statically prevent data races.
            • Tried repeatedly, ongoing work …
     – Dynamically detect the relevant data races.
7/28/2011                                            50
Dynamic Race Detection
• Need to guarantee one of:
     – Program is data-race free and provides SC execution (done),
     – Program contains a data race and raises an exception, or
     – Program exhibits simple semantics anyway, e.g.
            • Sequentially consistent
            • Synchronization-free regions are atomic
• This is significantly cheaper than fully accurate data-race
  detection.
     – Track byte-level R/W information
     – Mostly in cache
     – As opposed to epoch number + thread id per byte



7/28/2011                                                            51
For more information:
• Boehm, “Threads Basics”, HPL TR 2009-259.
• Boehm, Adve, “Foundations of the C++ Concurrency Memory
  Model”, PLDI 08.
• Sevcık and Aspinall, “On Validity of Program Transformations in the
  Java Memory Model”, ECOOP 08.
• Sewell et al, “x86-TSO: A Rigorous and Usable Programmer‟s
  Model for x86”, CACM, July 2010.
• S. V. Adve, Boehm, “Memory Models: A Case for Rethinking Parallel
  Languages and Hardware”, CACM, August 2010.
• Lucia, Strauss, Ceze, Qadeer, Boehm, "Conflict Exceptions:
  Providing Simple Parallel Language Semantics with Precise
  Hardware Exceptions, ISCA 2010. Also Marino et al, PLDI 10.
• Effinger-Dean, Boehm, Chakrabarti, Joisha, “Extended Sequential
  Reasoning for Data-Race-Free Programs”, MSPC 11.



7/28/2011                                                          52
Questions?




7/28/2011                53
Backup slides




7/28/2011                   54
Introducing Races:
            Register Promotion 2
                       r = g;
                       for(...) {
[g is global]            if(mt) {
for(...) {                 g = r; lock(); r = g;
                         }
    if(mt) lock();
                         use/update r instead of g;
    use/update g;        if(mt) {
    if(mt) unlock();       g = r; unlock(); r = g;
                         }
}
                       }
                       g = r;


7/28/2011                                       55
Trylock:
            Critical section reordering?
• Reordering of memory operations with respect to critical
  sections:

    Expected (& Java):   Naïve pthreads:   Optimized pthreads



       lock()                lock()             lock()




     unlock()              unlock()          unlock()



7/28/2011                                                       56
Some open source pthread lock
   implementations (2006):
       lock()                  lock()            lock()           lock()




     unlock()               unlock()          unlock()           unlock()




  [technically incorrect]   [Correct, slow]      [Correct]        [Incorrect]
            NPTL                NPTL               NPTL           FreeBSD
   {Alpha, PowerPC}         Itanium (&X86)    { Itanium, X86 }     Itanium
      {mutex, spin}             mutex              spin              spin
7/28/2011                                                                       57
Ad

Recommended

Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)
Grigory Sapunov
 
Nicpaper2009
Nicpaper2009
bikram ...
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Universitat Politècnica de Catalunya
 
Transformer Zoo
Transformer Zoo
Grigory Sapunov
 
Performance and predictability
Performance and predictability
RichardWarburton
 
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
Universitat Politècnica de Catalunya
 
Jvm profiling under the hood
Jvm profiling under the hood
RichardWarburton
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
Brandon Liu
 
LSTM
LSTM
佳蓉 倪
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
佳蓉 倪
 
RNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq Models
Emory NLP
 
nn network
nn network
Shivashankar Hiremath
 
Rnn & Lstm
Rnn & Lstm
Subash Chandra Pakhrin
 
Performance and predictability (1)
Performance and predictability (1)
RichardWarburton
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」
kurotaki_weblab
 
Sequence learning and modern RNNs
Sequence learning and modern RNNs
Grigory Sapunov
 
Collections forceawakens
Collections forceawakens
RichardWarburton
 
ASE02.ppt
ASE02.ppt
Ptidej Team
 
LSTM Basics
LSTM Basics
Akshay Sehgal
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
Codefreeze eng
Codefreeze eng
Devexperts
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
David Walker
 
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Jonathan Salwan
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
Jason Hearne-McGuiness
 
Deep Learning in theano
Deep Learning in theano
Massimo Quadrana
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo
 
A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.
Jason Hearne-McGuiness
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
Takayuki Muranushi
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
Skills Matter
 
Memory models
Memory models
Dr. C.V. Suresh Babu
 

More Related Content

What's hot (19)

LSTM
LSTM
佳蓉 倪
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
佳蓉 倪
 
RNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq Models
Emory NLP
 
nn network
nn network
Shivashankar Hiremath
 
Rnn & Lstm
Rnn & Lstm
Subash Chandra Pakhrin
 
Performance and predictability (1)
Performance and predictability (1)
RichardWarburton
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」
kurotaki_weblab
 
Sequence learning and modern RNNs
Sequence learning and modern RNNs
Grigory Sapunov
 
Collections forceawakens
Collections forceawakens
RichardWarburton
 
ASE02.ppt
ASE02.ppt
Ptidej Team
 
LSTM Basics
LSTM Basics
Akshay Sehgal
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
Codefreeze eng
Codefreeze eng
Devexperts
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
David Walker
 
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Jonathan Salwan
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
Jason Hearne-McGuiness
 
Deep Learning in theano
Deep Learning in theano
Massimo Quadrana
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo
 
A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.
Jason Hearne-McGuiness
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
佳蓉 倪
 
RNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq Models
Emory NLP
 
Performance and predictability (1)
Performance and predictability (1)
RichardWarburton
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」
kurotaki_weblab
 
Sequence learning and modern RNNs
Sequence learning and modern RNNs
Grigory Sapunov
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
Codefreeze eng
Codefreeze eng
Devexperts
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
David Walker
 
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Jonathan Salwan
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
Jason Hearne-McGuiness
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo
 
A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.
Jason Hearne-McGuiness
 

Similar to Programming Language Memory Models: What do Shared Variables Mean? (20)

Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
Takayuki Muranushi
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
Skills Matter
 
Memory models
Memory models
Dr. C.V. Suresh Babu
 
Как мы охотимся на гонки (data races) или «найди багу до того, как она нашла ...
Как мы охотимся на гонки (data races) или «найди багу до того, как она нашла ...
yaevents
 
Nondeterminism is unavoidable, but data races are pure evil
Nondeterminism is unavoidable, but data races are pure evil
racesworkshop
 
Memory model
Memory model
Yi-Hsiu Hsu
 
Parallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical Section
Tony Albrecht
 
Interactions complicate debugging
Interactions complicate debugging
Syed Zaid Irshad
 
Memory model
Memory model
MingdongLiao
 
Lect04
Lect04
Vin Voro
 
Computer architecture related concepts, process
Computer architecture related concepts, process
ssusera979f41
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on it
Zubair Nabi
 
C++0x
C++0x
Vaibhav Bajaj
 
ch13 here is the ppt of this chapter included pictures
ch13 here is the ppt of this chapter included pictures
PranjalRana4
 
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
tugrulh
 
Multithreading 101
Multithreading 101
Tim Penhey
 
cs2110Concurrency1.ppt
cs2110Concurrency1.ppt
narendra551069
 
Semantics
Semantics
Jahanzeb Jahan
 
Os3
Os3
gopal10scs185
 
Concurrency
Concurrency
Isaac Liao
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
Takayuki Muranushi
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
Skills Matter
 
Как мы охотимся на гонки (data races) или «найди багу до того, как она нашла ...
Как мы охотимся на гонки (data races) или «найди багу до того, как она нашла ...
yaevents
 
Nondeterminism is unavoidable, but data races are pure evil
Nondeterminism is unavoidable, but data races are pure evil
racesworkshop
 
Parallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical Section
Tony Albrecht
 
Interactions complicate debugging
Interactions complicate debugging
Syed Zaid Irshad
 
Computer architecture related concepts, process
Computer architecture related concepts, process
ssusera979f41
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on it
Zubair Nabi
 
ch13 here is the ppt of this chapter included pictures
ch13 here is the ppt of this chapter included pictures
PranjalRana4
 
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
tugrulh
 
Multithreading 101
Multithreading 101
Tim Penhey
 
cs2110Concurrency1.ppt
cs2110Concurrency1.ppt
narendra551069
 
Ad

More from greenwop (9)

Performance Analysis of Idle Programs
Performance Analysis of Idle Programs
greenwop
 
Unifying Remote Data, Remote Procedure, and Service Clients
Unifying Remote Data, Remote Procedure, and Service Clients
greenwop
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
greenwop
 
Category theory, Monads, and Duality in the world of (BIG) Data
Category theory, Monads, and Duality in the world of (BIG) Data
greenwop
 
A Featherweight Approach to FOOL
A Featherweight Approach to FOOL
greenwop
 
The Rise of Dynamic Languages
The Rise of Dynamic Languages
greenwop
 
Turning a Tower of Babel into a Beautiful Racket
Turning a Tower of Babel into a Beautiful Racket
greenwop
 
Normal Considered Harmful
Normal Considered Harmful
greenwop
 
High Performance JavaScript
High Performance JavaScript
greenwop
 
Performance Analysis of Idle Programs
Performance Analysis of Idle Programs
greenwop
 
Unifying Remote Data, Remote Procedure, and Service Clients
Unifying Remote Data, Remote Procedure, and Service Clients
greenwop
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
greenwop
 
Category theory, Monads, and Duality in the world of (BIG) Data
Category theory, Monads, and Duality in the world of (BIG) Data
greenwop
 
A Featherweight Approach to FOOL
A Featherweight Approach to FOOL
greenwop
 
The Rise of Dynamic Languages
The Rise of Dynamic Languages
greenwop
 
Turning a Tower of Babel into a Beautiful Racket
Turning a Tower of Babel into a Beautiful Racket
greenwop
 
Normal Considered Harmful
Normal Considered Harmful
greenwop
 
High Performance JavaScript
High Performance JavaScript
greenwop
 
Ad

Recently uploaded (20)

Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Mastering AI Workflows with FME by Mark Döring
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Mastering AI Workflows with FME by Mark Döring
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 

Programming Language Memory Models: What do Shared Variables Mean?

  • 1. Programming Language Memory Models: What do Shared Variables Mean? Hans-J. Boehm HP Labs 7/28/2011 1
  • 2. The problem • Shared memory parallel programs are built on shared variables visible to multiple threads of control. • But there is a lot of confusion about what those variables mean: – Are concurrent accesses allowed? – What is a concurrent access? – When do updates become visible to other threads? – Can an update be partially visible? • Many recent efforts with serious technical issues: – Java, OpenMP 3.0, UPC, Go … 7/28/2011 2
  • 3. Credits and Disclaimers: • Much of this work was done by others or jointly. I‟m relying particularly on: – Basic approach: Sarita Adve, Mark Hill, Ada 83 … – JSR 133: Also Jeremy Manson, Bill Pugh, Doug Lea – C++11: Lawrence Crowl, Clark Nelson, Paul McKenney, Herb Sutter, … – Improved hardware models: Peter Sewell‟s group, many Intel, AMD, ARM, IBM participants … – Conflict exception work: Ceze, Lucia, Qadeer, Strauss – Recent Java Memory Model work: Sevcik, Aspinall, Cenciarelli – Race-free optimization: Pramod Joisha, Laura Effinger-Dean, Dhruva Chakrabarti • But some of it is still controversial. – This reflects my personal views. 7/28/2011 3
  • 4. Outline • Emerging consensus: – Interleaving semantics (Sequential Consistency) – But only for data-race-free programs • Brief discussion of consequences – Software requirements – Hardware requirements • Major remaining problem: – Java can‟t outlaw races. – We don‟t know how to give meaning to data races. – Some speculative solutions. 7/28/2011 4
  • 5. Naive threads programming model (Sequential Consistency) • Threads behave as though their memory accesses were simply interleaved. (Sequential consistency) Thread 1 Thread 2 x = 1; y = 2; z = 3; – might be executed as x = 1; y = 2; z = 3; 7/28/2011 5
  • 6. Locks restrict interleavings Thread 1 Thread 2 lock(l); lock(l); r1 = x; r2 = x; x = r1+1; x = r2+1; unlock(l); unlock(l); – can only be executed as lock(l); r1 = x; x = r1+1; unlock(l); lock(l); r2 = x; x = r2+1; unlock(l); or lock(l); r2 = x; x = r2+1; unlock(l); lock(l); r1 = x; x = r1+1; unlock(l); since second lock(l) must follow first unlock(l) 7/28/2011 6
  • 7. Atomic sections / transactional memory are just like a single global lock. 7/28/2011 7
  • 8. But this doesn‟t quite work … • Limits reordering and other hardware/compiler transformations – “Dekker‟s” example (everything initially zero) should allow r1 = r2 = 0: Thread 1 Thread 2 x = 1; y = 1; r1 = y; r2 = x; – Compilers like to move up loads. – Hardware likes to buffer stores. 7/28/2011 8
  • 9. Sensitive to memory access granularity Thread 1 Thread 2 x = 300; x = 100; • If memory is accessed a byte at a time, this may be executed as: x_high = 0; x_high = 1; // x = 256 x_low = 44; // x = 300; x_low = 100; // x = 356; 7/28/2011 9
  • 10. And this is at too low a level … • And taking advantage of sequential consistency involves reasoning about memory access interleaving: – Much too hard. – Want to reason about larger “atomic” code regions • which can‟t be visibly interleaved. 09/08/2010 10
  • 11. Real threads programming model (1) • Two memory accesses conflict if they – access the same scalar object*, e.g. variable. – at least one access is a store. – E.g. x = 1; and r2 = x; conflict • Two ordinary memory accesses participate in a data race if they – conflict, and – can occur simultaneously • i.e. appear as adjacent operations (diff. threads) in interleaving. • A program is data-race-free (on a particular input) if no sequentially consistent execution results in a data race. * or contiguous sequence of bit-fields 09/08/2010 11
  • 12. Real threads programming model (2) • Sequential consistency only for data-race- free programs! – Avoid anything else. • Data races are prevented by – locks (or atomic sections) to restrict interleaving – declaring synchronization variables • (in two slides …) 09/08/2010 12
  • 13. Alternate data-race-free definition: happens-before • Memory access a happens-before b if – a precedes b in program order. Thread 1: – a and b are synchronization operations, and b observes the results of a, thus lock(m); enforcing ordering between them. t = x + 1; • e.g. a unlocks m, b subsequently locks m. – Or there is a c such that a happens-before x = t; c and c happens-before b. unlock(m) • Two ordinary conflicting memory operations a and b participate in a Thread 2: data race if neither lock(m); – a happens-before b, t = x + 1; – nor b happens-before a. • Set of data-race-free programs usually the x = t; same. unlock(m)
  • 14. Synchronization variables • Java: volatile, java.util.concurrent.atomic. • C++11: atomic<int>, atomic_int • C1x: _Atomic(int), _Atomic int, atomic_int • Guarantee indivisibility of operations. • “Don‟t count” in determining whether there is a data race: – Programs with “races” on synchronization variables are still sequentially consistent. – Though there may be “escapes” (Java, C++11). • Dekker‟s algorithm “just works” with synchronization variables. 7/28/2011 14
  • 15. As expressive as races Double-checked locking: Double-checked locking: Wrong! Correct (C++11): bool x_init; atomic<bool> x_init; if (!x_init) { if (!x_init) { l.lock(); l.lock(); if (!x_init) { if (!x_init) { initialize x; initialize x; x_init = true; x_init = true; } } l.unlock(); l.unlock(); } } use x; use x; 7/28/2011 15
  • 16. Data Races Revisited • Are defined in terms of sequentially consistent executions. • If x and y are initially zero, this does not have a data race: Thread 1 Thread 2 if (x) if (y) y = 1; x = 1; 7/28/2011 16
  • 17. SC for DRF programming model advantages over SC • Supports important hardware & compiler optimizations. • DRF restriction  Synchronization-free code sections appear to execute atomically, i.e. without visible interleaving. – If one didn‟t: Thread 1 (not atomic): Thread 2(observer): a = 1; if (a == 1 && b == 0) { b = 1; … } 09/08/2010 17
  • 18. SC for DRF implementation model (1) • Very restricted reordering of lock(); memory operations around synch-free synchronization operations: code – Compiler either understands these, or region treats them as opaque, potentially updating any location. unlock(); – Synchronization operations include synch-free instructions to limit or prevent hardware reordering (“memory code fences”). region • e.g. lock acquisition, release, atomic store, might contain memory fences.
  • 19. SC for DRF implementation model (2) • Code may be reordered lock(); between synchronization operations. synch-free – Another thread can only tell if it code accesses the same data between region reordered operations. – Such an access would be a data unlock(); race. synch-free • If data races are disallowed code (e.g. Posix, Ada, C++11, not Java), compiler may assume region that variables don‟t change asynchronously.
  • 20. Data races Some variants are errors C++11 draft SC for DRF, X with explicit easily avoidable exceptions C1x draft Java SC for DRF, with some exceptions More details later. Ada83 SC for DRF (?) X Pthreads SC for DRF (??) X OpenMP SC for DRF (??, except atomics) X Fortran 2008 SC for DRF (??, except atomics) X C#, .Net SC for DRF when restricted to locks + Interlocked (???) 7/28/2011 20
  • 21. The exceptions in C++11: atomic<bool> x_init; if (!x_init.load(memory_order_acquire) { l.lock(); if (!x_init.load(memory_order_relaxed) { initialize x; x_init.store(true, memory_order_release); } l.unlock(); } use x; We‟ll ignore this for the rest of the talk …
  • 22. Data Races as Errors • In many languages, data races are errors – In the sense of out-of-bounds array stores in C. – Anything can happen: • E.g. program can crash, local variables can appear to change, etc. – There are no benign races in e.g. C++11 or Ada 83. – Compilers may, and do, assume there are no races. If that assumption is false, all bets are off. – See e.g. Boehm, How to miscompile programs with “benign” data races, HotPar „11. . 7/28/2011 22
  • 23. How a data race may cause a crash unsigned x; • Assume switch statement compiled as If (x < 3) { branch table. • May assume x is in … // async x change range. switch(x) { • Asynchronous change to case 0: … x causes wild branch. case 1: … – Not just wrong value. case 2: … } } 23 28 July 2011
  • 24. Note: We can usually ignore wait/notify • In most languages wait can return without notify. – “Spurious wakeup” • Except for termination issues: – wait() ≡ unlock(); lock() – notify() ≡ nop • Applies to existence of data races, partial correctness, flow analysis. 7/28/2011 24
  • 25. Another note on data race definition • We define it in terms of scalar accesses, but • Container libraries should ensure that Container accesses don‟t race  No races on memory locations • This means – Accesses to hidden shared state (caches, allocation) must be locked by implementation. – User must lock for container-level races. • This is often the correct library thread-safety condition. 7/28/2011 25
  • 26. Outline • Emerging consensus: – Interleaving semantics (Sequential Consistency) – But only for data-race-free programs • Brief discussion of consequences – Software requirements – Hardware requirements • Major remaining problem: – Java can‟t outlaw races. – We don‟t know how to give meaning to data races. – Some speculative solutions. 7/28/2011 26
  • 27. Compilers must not introduce data races • Single thread compilers currently may add data races: (PLDI 05) struct {char a; char b} x; tmp = x; x.a = „z‟; tmp.a = „z‟; x = tmp; – x.a = 1 in parallel with x.b = 1 may fail to update x.b. • … and much more interesting examples. • Still broken in gcc with bit-fields. 7/28/2011 27
  • 28. Some restrictions are a bit more annoying: • Compiler may not introduce “speculative” stores: int count; // global, possibly shared … for (p = q; p != 0; p = p -> next) if (p -> data > 0) ++count; int count; // global, possibly shared … reg = count; for (p = q; p != 0; p = p -> next) if (p -> data > 0) ++reg; count = reg; // may spuriously assign to count 28 28 July 2011
  • 29. Introducing data races: Potentially infinite loops ? while (…) { *p = 1; } x = 1; x = 1; while (…) { *p = 1; } • x not otherwise accessed by loop. • Prohibited in Java. • Allowed by giving undefined behavior to such infinite loops in C++0x. (Controversial.) • Common text book analyses treat as equivalent? 7/28/2011 29
  • 30. Note on program analysis for optimization • Lots of research papers on analysis of sequentially consistent parallel programs. • AFAIK: Real compilers don’t do that! – For good reason. – Use sequential optimization in sync-free regions. • SC analysis is – Generally sound, but pessimistic for C++11 (minus SC escapes) – Very much so with incomplete information – Unsound for Java (but …) 7/28/2011 30
  • 31. Some positive optimization consequences • Rules are finally clear! – Optimization does not need to preserve full sequential consistency! • In languages that outlaw data races – Compilers already use that. – It can be leveraged more aggressively: – Assuming no other synchronization, x not motified by this thread, x is loop invariant in while (…) { … x … l.lock(); … l.unlock(); } 7/28/2011 31
  • 32. Language spec challenge 1 • Some common language features almost unavoidably introduce data races. • Most significant example: – Detached/daemon threads, combined with – Destructors / finalizers for static variables – Detached threads call into libraries that may access static variables • Even while they‟re being cleaned up. 7/28/2011 32
  • 33. Language spec challenge 2: • Some really awful code: Thread 2: Don’t try this at home!! Thread 1: x = 42; while (m.trylock()==SUCCESS) ? m.lock(); m.unlock(); assert (x == 42); • Disclaimer: Example requires • Can the assertion fail? tweaking to be pthreads- compliant. • Many implementations: Yes • Traditional specs: No. C++11: Yes • Trylock() can effectively fail spuriously! 09/08/2010 33
  • 34. Outline • Emerging consensus: – Interleaving semantics (Sequential Consistency) – But only for data-race-free programs • Brief discussion of consequences – Software requirements – Hardware requirements • Major remaining problem: – Java can‟t outlaw races. – We don‟t know how to give meaning to data races. – Some speculative solutions. 7/28/2011 34
  • 35. Byte store instructions • x.c = „a‟; may not visibly read and rewrite adjacent fields. • Byte stores must be implemented with – Byte store instruction, or – Atomic read-modify-write. • Typically expensive on multiprocessors. • Often cheaply implementable on uniprocessors. 7/28/2011 35
  • 36. Sequential consistency must be enforceable • Programs using only synchronization variables must be sequentially consistent. • Compiler literature contains many papers on enforcing sequential consistency by adding fences. But: – Not really possible on Itanium. – Wasn‟t possible on X86 until the re-revision of the spec last year. – Took months of discussions with PowerPC architects to conclude it‟s (barely, sort of) possible there. • The core issue is “write atomicity”: 7/28/2011 36
  • 37. Can fences enforce SC? Unclear that hardware fences can ensure sequential consistency. “IRIW” example: x, y initially zero. Fences between every instruction pair! Thread 1: Thread 2: Thread 3: Thread 4: x = 1; r1 = x; (1) y = 1; r3 = y; (1) fence; fence; r2 = y; (0) r4 = x; (0) x set first! y set first! Fully fenced, not sequentially consistent. Does hardware allow it? 7/28/2011 37
  • 38. Why does it matter? • Nobody cares about IRIW!? • It‟s a pain to enforce on at least PowerPC. • Many people (Sarita Adve, Doug Lea, Vijay Saraswat) spent about a year trying to relax SC requirement here. • (Personal opinion) The results were incomprehensible, and broke more important code. • No viable alternatives! 7/28/2011 38
  • 39. Acceptable hardware memory models • More challenging requirements: 1. Precise memory model specification 2. Byte stores 3. Cheap mechanism to enforce write atomicity 4. Dirt cheap mechanism to enforce data dependency ordering(?) (Java final fields) • Other than that, all standard approaches appear workable, but … 7/28/2011 39
  • 40. Replace fences completely? Synchronization variables on X86 • atomic store: ~1 cycle dozens of cycles – store (mov); mfence; • atomic load: ~1 cycle – load (mov) • Store implicitly ensures that prior memory operations become visible before store. • Load implicitly ensures that subsequent memory operations become visible later. • Sole reason for mfence: Order atomic store followed by atomic load. 7/28/2011 40
  • 41. Fence enforces all kinds of additional, unobservable orderings • s is a synchronization variable: x = 1; s = 2; // includes fence r1 = y; • Prevents reordering of x = 1 and r1 = y; – final load delayed until assignment to a visible. • But this ordering is invisible to non-racing threads – …and expensive to enforce? • We need a tiny fraction of mfence functionality. 7/28/2011 41
  • 42. Outline • Emerging consensus: – Interleaving semantics (Sequential Consistency) – But only for data-race-free programs • Brief discussion of consequences – Software requirements – Hardware requirements • Major remaining problem: – Java can‟t outlaw races. – We don‟t know how to give meaning to data races. – Some speculative solutions. 7/28/2011 42
  • 43. Data Races in Java • C++11 leaves data race semantics undefined. – “catch fire” semantics • Java supports sand-boxed code. • Don‟t know how to prevent data-races in sand-boxed, malicious code. • Java must provide some guarantees in the presence of data races. 7/28/2011 43
  • 44. Interesting data race outcome? x, y initially null, Loads may or may not see racing stores? Thread 1: Thread 2: r1 = x; r2 = y; y = r1; x = r2; Outcome: x = y = r1 = r2 = “<your bank password here>” 7/28/2011 44
  • 45. The Java Solution Quotation from 17.4.8, Java Language Specification, 3rd edition, omitted, to avoid possible copyright questions. It addresses the issue by explicitly outlawing the causality cycles required to get the dubious result from the last slide. The important point is that this is a rather complex mathematical specification. … 7/28/2011 45
  • 46. Complicated, but nice properties? • Manson, Pugh, Adve: The Java Memory Model, POPL 05 Quotation from section 9.1.2 of above paper omitted, to avoid possible copyright questions. This asserts (Theorem 1) that non-conflicting operations may be reordered by a compiler. 7/28/2011 46
  • 47. Much nicer than prior attempts, but: • Aspinall, Sevcik, “Java Memory Model Examples: Good, Bad, and Ugly”, VAMP 2007 (also ECOOP 2008 paper) Quotation from above paper omitted, to avoid possible copyright questions. This ends in the statement: “This falsifies Theorem 1 of [paper from previous slide].” Note: The underlying observation is due to Pietro Cenciarelli. 7/28/2011 47
  • 48. Why is this hard? • Want – Constrained race semantics for essential security properties. – Unconstrained race semantics to support compiler and hardware optimizations. – Simplicity. • No known good resolution. 7/28/2011 48
  • 49. Outline • Emerging consensus: – Interleaving semantics (Sequential Consistency) – But only for data-race-free programs • Brief discussion of consequences – Software requirements – Hardware requirements • Major remaining problem: – Java can‟t outlaw races. – We don‟t know how to give meaning to data races. – Some speculative solutions. 7/28/2011 49
  • 50. A Different Approach • Outlaw data races. • Require violations to be detectable! – Even in malicious sand-boxed code. • Possible approaches: – Statically prevent data races. • Tried repeatedly, ongoing work … – Dynamically detect the relevant data races. 7/28/2011 50
  • 51. Dynamic Race Detection • Need to guarantee one of: – Program is data-race free and provides SC execution (done), – Program contains a data race and raises an exception, or – Program exhibits simple semantics anyway, e.g. • Sequentially consistent • Synchronization-free regions are atomic • This is significantly cheaper than fully accurate data-race detection. – Track byte-level R/W information – Mostly in cache – As opposed to epoch number + thread id per byte 7/28/2011 51
  • 52. For more information: • Boehm, “Threads Basics”, HPL TR 2009-259. • Boehm, Adve, “Foundations of the C++ Concurrency Memory Model”, PLDI 08. • Sevcık and Aspinall, “On Validity of Program Transformations in the Java Memory Model”, ECOOP 08. • Sewell et al, “x86-TSO: A Rigorous and Usable Programmer‟s Model for x86”, CACM, July 2010. • S. V. Adve, Boehm, “Memory Models: A Case for Rethinking Parallel Languages and Hardware”, CACM, August 2010. • Lucia, Strauss, Ceze, Qadeer, Boehm, "Conflict Exceptions: Providing Simple Parallel Language Semantics with Precise Hardware Exceptions, ISCA 2010. Also Marino et al, PLDI 10. • Effinger-Dean, Boehm, Chakrabarti, Joisha, “Extended Sequential Reasoning for Data-Race-Free Programs”, MSPC 11. 7/28/2011 52
  • 55. Introducing Races: Register Promotion 2 r = g; for(...) { [g is global] if(mt) { for(...) { g = r; lock(); r = g; } if(mt) lock(); use/update r instead of g; use/update g; if(mt) { if(mt) unlock(); g = r; unlock(); r = g; } } } g = r; 7/28/2011 55
  • 56. Trylock: Critical section reordering? • Reordering of memory operations with respect to critical sections: Expected (& Java): Naïve pthreads: Optimized pthreads lock() lock() lock() unlock() unlock() unlock() 7/28/2011 56
  • 57. Some open source pthread lock implementations (2006): lock() lock() lock() lock() unlock() unlock() unlock() unlock() [technically incorrect] [Correct, slow] [Correct] [Incorrect] NPTL NPTL NPTL FreeBSD {Alpha, PowerPC} Itanium (&X86) { Itanium, X86 } Itanium {mutex, spin} mutex spin spin 7/28/2011 57