Functional programming considered bad for parallelism

What's a "functional data structure"?

Or are you conflating existing implementations of functional languages
with functional languages and/or functional programming in general?

Post by Jon Harrop
Allocation is access to a global mutable
resource (the shared heap). Even with a mostly-concurrent collector, any
thread allocating heavily degrades the performance of all other threads and
destroys scalability across the entire system.

There's no reason why you can't give each thread a separate arena.

Post by Jon Harrop
Secondly, functional data structures are more memory intensive than
imperative data structures. This places more stress on memory bandwidth
which, again, is a shared resource and stressing it degrades scalability
across the entire system.

This claim cannot be addressed without explaining what you consider to be
a functional data structure.

Post by Jon Harrop
So it seems the eschewing mutation may be a fundamentally bad idea in the
context of parallelism.

Even if it's true in some cases, which cases depend upon the type of
parallelism you're talking about. With NUMA, a functional approach
with higher total memory bandwidth distributed evenly between banks may
beat an imperative approach with a lower total bandwidth concentrated on a
single bank.

Jon Harrop

2009-11-15 23:24:19 UTC

What's a "functional data structure"?

Immutable:

http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf

Post by Nobody
Or are you conflating existing implementations of functional languages
with functional languages and/or functional programming in general?

I am basing this on what purely functional data structures are currently
available.

Post by Jon Harrop
Allocation is access to a global mutable
resource (the shared heap). Even with a mostly-concurrent collector, any
thread allocating heavily degrades the performance of all other threads
and destroys scalability across the entire system.

There's no reason why you can't give each thread a separate arena.

Threads do have their own nursery generations in production quality
collectors but heavy allocation still kills scalability.

Post by Jon Harrop
So it seems the eschewing mutation may be a fundamentally bad idea in the
context of parallelism.

Can you give a concrete example?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Vend

2009-11-16 01:00:38 UTC

Post by Jon Harrop
Threads do have their own nursery generations in production quality
collectors but heavy allocation still kills scalability.

If have n cores with one thread per core, and you divide the heap in n
areas, one per thread, how does heavy allocation affect scalability?

Jon Harrop

2009-11-16 01:48:34 UTC

Allocation is access to a global mutable resource (the shared heap).

...

Threads do have their own nursery generations in production quality
collectors but heavy allocation still kills scalability.

If have n cores with one thread per core, and you divide the heap in n
areas, one per thread, how does heavy allocation affect scalability?

You are describing distributed parallelism without a shared heap which is a
different scenario. In that case, allocation does scale but communication
does not. Specifically, the advantage of a shared heap is that one thread
can pass data to another thread by reference. Without a shared heap you
must resort to copying the data (usually into a message), which stresses
memory bandwidth and introduces contention for the thread that owns the
data and both kill scalability.

In essence, that Erlang-style design is great for distributed parallelism
but fails to take advantage of the shared-memory nature of a multicore.
Consequently, it is very slow in absolute terms.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Vend

2009-11-16 09:01:13 UTC

Allocation is access to a global mutable resource (the shared heap).

...

Threads do have their own nursery generations in production quality
collectors but heavy allocation still kills scalability.

If have n cores with one thread per core, and you divide the heap in n
areas, one per thread, how does heavy allocation affect scalability?

You are describing distributed parallelism without a shared heap which is a
different scenario.

No.
Maybe I didn't express myself clearly, but I was thinking of threads
sharing the same heap but each allocating into a different areas.

Post by Jon Harrop
introduces contention for the thread that owns the
data

What do you mean?

Post by Jon Harrop
In essence, that Erlang-style design is great for distributed parallelism
but fails to take advantage of the shared-memory nature of a multicore.
Consequently, it is very slow in absolute terms.

I suppose that in Erlang when you send messages between threads
running on the same vm, it actually passes references, doesn't it?

Jon Harrop

2009-11-17 02:22:39 UTC

Allocation is access to a global mutable resource (the shared heap).

...

Threads do have their own nursery generations in production quality
collectors but heavy allocation still kills scalability.

If have n cores with one thread per core, and you divide the heap in n
areas, one per thread, how does heavy allocation affect scalability?

You are describing distributed parallelism without a shared heap which is
a different scenario.

No.
Maybe I didn't express myself clearly, but I was thinking of threads
sharing the same heap but each allocating into a different areas.

In what sense is that a shared heap?

Post by Jon Harrop
introduces contention for the thread that owns the data

What do you mean?

If many threads want to read that data then they all have to ask permission
from the owner (who sends the message containing the data) which is
contention on a single global shared resource (the owner) and that doesn't
scale.

I suppose that in Erlang when you send messages between threads
running on the same vm, it actually passes references, doesn't it?

No, it copies everything. Indeed, that is precisely why Erlang is so slow
and unsuitable for parallel programming.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Paul Rubin

2009-11-17 02:35:56 UTC

Post by Vend
I suppose that in Erlang when you send messages between threads
running on the same vm, it actually passes references, doesn't it?

No, it copies everything. Indeed, that is precisely why Erlang is so slow
and unsuitable for parallel programming.

I thought that HIPE optimizes this.

Vend

2009-11-17 23:56:08 UTC

Allocation is access to a global mutable resource (the shared heap).

...

Threads do have their own nursery generations in production quality
collectors but heavy allocation still kills scalability.

If have n cores with one thread per core, and you divide the heap in n
areas, one per thread, how does heavy allocation affect scalability?

You are describing distributed parallelism without a shared heap which is
a different scenario.

No.
Maybe I didn't express myself clearly, but I was thinking of threads
sharing the same heap but each allocating into a different areas.

In what sense is that a shared heap?

Any thread can read from objects allocated on the heap by any other
thread.

Post by Jon Harrop
introduces contention for the thread that owns the data

What do you mean?

I don't get it. If shared data is immutable, there is no concept of
'ownership', at least not intuitively defined.

I think that message-based programs usually are designed so that
threads wakeup when new data is available (some sort of dataflow
design), rather than actively asking for data and waiting for the
answer.

I suppose that in Erlang when you send messages between threads
running on the same vm, it actually passes references, doesn't it?

No, it copies everything. Indeed, that is precisely why Erlang is so slow
and unsuitable for parallel programming.

Then it is a flaw of the implementation rather than the language,
since it would be quite easy and obvious to pass only references when
sending immutable data between threads running on the same node.

Jon Harrop

2009-11-18 18:50:38 UTC

Post by Jon Harrop
If many threads want to read that data then they all have to ask
permission from the owner (who sends the message containing the data)
which is contention on a single global shared resource (the owner) and
that doesn't scale.

I don't get it. If shared data is immutable, there is no concept of
'ownership', at least not intuitively defined.

Who is responsible for freeing the data? That is the owner. If you have
separate "arenas" for each thread then surely that thread must be
responsible for freeing data in its own arena, otherwise you have a shared
heap.

Post by Vend
I suppose that in Erlang when you send messages between threads
running on the same vm, it actually passes references, doesn't it?

No, it copies everything. Indeed, that is precisely why Erlang is so slow
and unsuitable for parallel programming.

That is only easy and obvious from the programmers point of view. From the
language implementors point of view, it is probably the single most
difficult challenge they face: writing a concurrent garbage collector that
can collect values that are not reachable from any threads in the face of
programs that pass references between threads.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Vend

2009-11-19 11:14:54 UTC

I don't get it. If shared data is immutable, there is no concept of
'ownership', at least not intuitively defined.

I'm not an expert on concurrent and parallel garbage collection, but I
suppose that algorithms that scale with the number of cores, at least
in average case scenarios, exist.

Anyway, if you use mutable shared memory, you still have the same
problem, unless you strictly separe local heap from shared heap (using
processes instead of threads, for instance) and manage the shared heap
manually.

Jon Harrop

2009-11-19 16:14:14 UTC

Post by Jon Harrop
Who is responsible for freeing the data? That is the owner. If you have
separate "arenas" for each thread then surely that thread must be
responsible for freeing data in its own arena, otherwise you have a
shared heap.

I'm not an expert on concurrent and parallel garbage collection, but I
suppose that algorithms that scale with the number of cores, at least
in average case scenarios, exist.

Allocating heavily on any thread stalls allocation on all other threads
with .NET. I haven't tested other production-quality GCs but I assume they
are the same.

Post by Vend
Anyway, if you use mutable shared memory, you still have the same
problem, unless you strictly separe local heap from shared heap (using
processes instead of threads, for instance) and manage the shared heap
manually.

Purely functional data structures fill the heap with huge numbers of
pointers compared to imperative code, and that places far more stress on
the GC.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Vend

2009-11-19 16:31:17 UTC

I'm not an expert on concurrent and parallel garbage collection, but I
suppose that algorithms that scale with the number of cores, at least
in average case scenarios, exist.

Allocating heavily on any thread stalls allocation on all other threads
with .NET. I haven't tested other production-quality GCs but I assume they
are the same.

What gc algorithm does .NET use?
The Sun JVM offers several, with different theoretical performances.

I can't test them since I don't have access to a multicore machine.
But, if I understand correctly, parallel scavenge should scale
properly in the scenario you describe.

Purely functional data structures fill the heap with huge numbers of
pointers compared to imperative code, and that places far more stress on
the GC.

I'm not sure this is correct:

Using immutable data structures causes more allocation, but unless
threads keep references to old versions of the same structure, the
overhead in number of reachable objects, and thus pointers, should be
linear in the number of threads.

Moreover immutable data structures perform better under generational
collection since they can't have old-to-young pointers.

Casey Hawthorne

2009-11-19 22:20:21 UTC

Post by Jon Harrop
Purely functional data structures fill the heap with huge numbers of
pointers compared to imperative code, and that places far more stress on
the GC.

I think this is a problem with linked data structures in general;
however, in Haskell one can use immutable (unboxed) arrays, where the
ordering between the elements is implicit, and therefore has no
pointer overhead.

I would include an immutable array as a purely functional data
structure.

--
Regards,
Casey

Nobody

2009-11-16 17:31:08 UTC

Post by Jon Harrop
Threads do have their own nursery generations in production quality
collectors but heavy allocation still kills scalability.

If have n cores with one thread per core, and you divide the heap in n
areas, one per thread, how does heavy allocation affect scalability?

Having multiple memory banks tied to distinct processors doesn't mean that
you can't access a bank from another processor, just that it may be less
efficient (but it's unlikely to be less efficient than unconditionally
copying the data).

Similarly, if you have a single CPU and a single memory bank but give each
thread its own arena, you can pass data between threads while avoiding
contention for allocation.

Avoiding in-place modification may reduce cache coherence, but nothing
prohibits a functional language implementation from using in-place
modification when it can detect that this is safe. The "Clean" language
explicitly supports this through uniqueness typing, but it's still
considered to be a functional language.

OTOH, even in imperative languages, it's common to use "functional" data
structures to simplify atomic updates. I.e. rather than locking a complex
data structure while it is updated in-place, create a modified version
then update a pointer (which can be done atomically).

Ultimately, it's entirely common for parallelisation to incur a
performance penalty (i.e. N cores don't provide an N-fold speed increase),
whether due to locking, communication, duplication of effort or whatever.

Jon Harrop

2009-11-17 04:16:37 UTC

Post by Jon Harrop
Threads do have their own nursery generations in production quality
collectors but heavy allocation still kills scalability.

If have n cores with one thread per core, and you divide the heap in n
areas, one per thread, how does heavy allocation affect scalability?

You are describing distributed parallelism without a shared heap which is
a different scenario. In that case, allocation does scale but
communication does not. Specifically, the advantage of a shared heap is
that one thread can pass data to another thread by reference. Without a
shared heap you must resort to copying the data (usually into a message),
which stresses memory bandwidth and introduces contention for the thread
that owns the data and both kill scalability.

Having multiple memory banks tied to distinct processors doesn't mean that
you can't access a bank from another processor, just that it may be less
efficient (but it's unlikely to be less efficient than unconditionally
copying the data).
Similarly, if you have a single CPU and a single memory bank but give each
thread its own arena, you can pass data between threads while avoiding
contention for allocation.

Sure but communicating updates to those data structures through the owning
thread is now the source of contention.

Avoiding in-place modification may reduce cache coherence, but nothing
prohibits a functional language implementation from using in-place
modification when it can detect that this is safe.

I appreciate the dream but even state-of-the-art FPL implementations are
nowhere near reaching it. Just the two line idiomatic Haskell "quicksort"
is 6,000x slower than it should be precisely because GHC is too dumb to
work out how to use in-place modification.

Post by Nobody
The "Clean" language explicitly supports this through uniqueness typing,
but it's still considered to be a functional language.

Aren't uniqueness types very limited in what they can express? For example,
could you reimplement one of the performant scalable concurrent Java hash
tables touted by Azul in Clean using uniqueness types?

Post by Nobody
OTOH, even in imperative languages, it's common to use "functional" data
structures to simplify atomic updates. I.e. rather than locking a complex
data structure while it is updated in-place, create a modified version
then update a pointer (which can be done atomically).

Yes.

Post by Nobody
Ultimately, it's entirely common for parallelisation to incur a
performance penalty (i.e. N cores don't provide an N-fold speed increase),
whether due to locking, communication, duplication of effort or whatever.

Sure.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Nobody

2009-11-17 18:55:48 UTC

Post by Nobody
Similarly, if you have a single CPU and a single memory bank but give each
thread its own arena, you can pass data between threads while avoiding
contention for allocation.

Sure but communicating updates to those data structures through the owning
thread is now the source of contention.

To the extent that's true, it's an argument against mutable data
structures and in favour of immutable ones.

With a "functional" structure, the writer creates a new version of the
structure then communicates the updated "root" to the reader(s). With a
mutable structure, the readers have to be locked out while modification is
in progress to prevent them from seeing an intermediate state.

Jon Harrop

2009-11-18 01:07:49 UTC

Post by Nobody
Similarly, if you have a single CPU and a single memory bank but give
each thread its own arena, you can pass data between threads while
avoiding contention for allocation.

Sure but communicating updates to those data structures through the
owning thread is now the source of contention.

How do you do that without having pointers between your heaps and without
having contention on the thread that owns the heap containing the data
structure?

Post by Nobody
With a mutable structure, the readers have to be locked out while
modification is in progress to prevent them from seeing an intermediate
state.

Not true in general. Concurrent hash tables are a counter example.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Vend

2009-11-19 17:49:10 UTC

Post by Nobody
With a mutable structure, the readers have to be locked out while
modification is in progress to prevent them from seeing an intermediate
state.

Not true in general. Concurrent hash tables are a counter example.

The problem with shared mutable data structures is that atomicity of
operations doesn't compose:

Say you have two concurrent shared hash tables.
You want to remove a (key, value) pair from one table and insert in
the other one, atomicily (no other thread should see a state with the
pair in both tables or in none).
How do you do that without using locks or transactional memory?

Vend

2009-11-18 00:04:23 UTC

Post by Nobody
Avoiding in-place modification may reduce cache coherence, but nothing
prohibits a functional language implementation from using in-place
modification when it can detect that this is safe. The "Clean" language
explicitly supports this through uniqueness typing, but it's still
considered to be a functional language.

Yes, but obviously shared structures can't be unique (single-
referenced).

Torben Ægidius Mogensen

2009-11-16 13:09:42 UTC

Post by Jon Harrop
Avoiding mutation is widely hailed as a design of choice for parallel
programming and this is used to advocate the adoption of functional
programming. However, functional programming seems to have some serious
disadvantages in the context of shared-memory parallelism for multicores.
Firstly, functional data structures are more allocation intensive that
imperative data structures. Allocation is access to a global mutable
resource (the shared heap). Even with a mostly-concurrent collector, any
thread allocating heavily degrades the performance of all other threads and
destroys scalability across the entire system.

You can solve this in various ways. One is to use both local and shared
heaps and allocate shared data structures on the shared heap. You need
to know which data structures are shared, which you can do in several
ways:

1. Copy data to the shared heap when it becomes shared (i.e., when you
pass a pointer to another thread). This costs, but only once.
Since data structures are immutable, the local threads can use their
own local copies even after copying to the shared heap. You can do
the copying lazily by having nodes in the shared heap which are
"pointers" into local heaps and then request copying of more nodes
when accessed. This is most easily done in languages that already
support lazy evaluation.

2. Use a program analysis to predict sharing. This requires
whole-program analysis and can be fragile.

3. Let types distinguish shared and local data. This is simple and
efficient, and you can let type inference handle most of the
analysis, except in module interfaces.

Post by Jon Harrop
Secondly, functional data structures are more memory intensive than
imperative data structures.

How so? A functional array does not take more space than an imperative
ditto, nor does a functional struct take up more space than an
imperative ditto.

Post by Jon Harrop
So it seems the eschewing mutation may be a fundamentally bad idea in the
context of parallelism.

Shared memory doesn't scale, so you are essentally looking at a model
that will go away in the future anyway.

In any case, you have "forgotten" the benefits of immutability to
parallelism:

- You can have multiple local copies of a data structure without
worrying about inconsistency.

- Evaluation order is much more flexible.

Torben

Jon Harrop

2009-11-17 03:55:37 UTC

Post by Torben Ãgidius Mogensen
You can solve this in various ways. One is to use both local and shared
heaps and allocate shared data structures on the shared heap. You need
to know which data structures are shared, which you can do in several
1. Copy data to the shared heap when it becomes shared (i.e., when you
pass a pointer to another thread). This costs, but only once.
Since data structures are immutable, the local threads can use their
own local copies even after copying to the shared heap. You can do
the copying lazily by having nodes in the shared heap which are
"pointers" into local heaps and then request copying of more nodes
when accessed. This is most easily done in languages that already
support lazy evaluation.
2. Use a program analysis to predict sharing. This requires
whole-program analysis and can be fragile.
3. Let types distinguish shared and local data. This is simple and
efficient, and you can let type inference handle most of the
analysis, except in module interfaces.

Very interesting, thank you.

Post by Jon Harrop
Secondly, functional data structures are more memory intensive than
imperative data structures.

How so? A functional array does not take more space than an imperative
ditto, nor does a functional struct take up more space than an
imperative ditto.

Assuming you mean an immutable array, it doesn't offer the same
functionality. Specifically, fast update. A purely functional data
structure with fast update (e.g. a balanced binary tree) would not be an
array and would take up more space than an array.

Post by Jon Harrop
So it seems the eschewing mutation may be a fundamentally bad idea in the
context of parallelism.

Shared memory doesn't scale, so you are essentally looking at a model
that will go away in the future anyway.

I don't think so. Imagine a future where we have millions of cores and a
dozen levels of caches. They will still be hierarchical and there will
still be clusters of local cores wherein shared memory will be the most
efficient solution for many problems within each cluster. This is not
unlike a cluster of multicore machines today.

Post by Torben Ãgidius Mogensen
In any case, you have "forgotten" the benefits of immutability to
- You can have multiple local copies of a data structure without
worrying about inconsistency.
- Evaluation order is much more flexible.

Why are those beneficial?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Michael Ekstrand

2009-11-17 04:11:58 UTC

Very interesting, thank you.

Post by Jon Harrop
Secondly, functional data structures are more memory intensive than
imperative data structures.

How so? A functional array does not take more space than an imperative
ditto, nor does a functional struct take up more space than an
imperative ditto.