Discussion:
multi-core software
(too old to reply)
Xah Lee
2009-06-04 16:46:44 UTC
Permalink
Of interest:

• Why Must Software Be Rewritten For Multi-Core Processors?
http://xahlee.org/UnixResource_dir/writ/multi-core_software.html

plain text version follows.

--------------------------------------------------
Why Must Software Be Rewritten For Multi-Core Processors?

Xah Lee, 2009-06-04

I had a revelation today, namely, that it is necessary to rewrite
software to use multi-processor in order to benefit from it.

This may sound stupid, but is a revelation to me. For the past decade,
the question has been on my mind, about why should software needs to
be rewritten to take advantage of multi-processors. Because, in my
mind, i thought that software are at some fundamental level just
algorithms, and algorithms, have nothing to do with hardware
implementation aspects such as number of processors. I always felt,
that those talks about the need or difficulty of rewriting software
for multi-processor (or multi-core these days) must be the product of
idiocy of industrial imperative coding monkies. In particular, some
languages such as java, the way they deal with it, seems to me
extremely stupid. e.g. the concept of threads. In my mind, there
should be a layer between the software and the hardware, such as the
operating system, or the compiler, that should be able to
automatically handle or compile the software so that it FULLY use the
multi-processors when present. In short, i thought that a algorithm
really has nothing to do with hardware details.

I never really thought hard about this issue, but today, since i got a
quad-core PC, so i looked into the issue, and thought about it, and i
realized the answer. The gist is that, algorithm, fundamentally means
manipulating some hardware, in fact, algorithm is a step by step
instruction about some form of hardware, even the hardware may be
abstract or virtual. For example, let's say the simplest case of 1+1.
It is a algorithm, but where is the hardware? You may say it's
abstract concept, or it being a mathematical model. What you call 1+1
depends on the context, but in our context, those numbers are the
hardware. To see this, lets say more complex example of listing primes
by sieve. Again, it comes down to “what is a number”? Here, numbers
can be stones, or arrangement of beads on abacus, it's hardware! As
another example, say sorting. To begin with, you have to have some
something to sort, that's hardware.

Another way to see this is that, for a given computing problem, there
are infinite number of algorithms to achieve the same thing. Some,
will be better ones, requiring less steps, or requiring less storage.
All these are concrete manipulation issues, and the thing being
manipulated, ultimately we have to call it hardware. So, when hardware
changes, say from one cpu to multi-cpu, there's no way for the
algorithm to magically change and adopt the changed hardware. If you
need a algorithm that is catered to the new hardware, you need a new
algorithm.

One might think that there might be algorithm Omega, such that it
takes input of old hardware O, new hardware N, and a algorithm A, and
output a algorithm B, such that B has the same behavior as A, but N+B
performs better than O+A. This is like asking for Strong AI.

One thing we often forgot is that algorithms ultimately translates to
manipulating hardware. In a modern digital computer, that means
software algorithms ultimately becomes machine instructions in CPU,
which manipulate the 1s and 0s in register, or electricity voltage in
transisters.

In a more mundane point of view, a automatic system for software to
work on multi-processors is a problem of breaking a problem into
discrete units (so that they can be computed in parallel). The problem
of finding a algorithm is entirely different from the problem of
breaking a problem into distinct units. The problem of breaking a
problem into distinct units is a entire new branch of mathematics. For
example, let's say factoring. Factoring is a well defined mathematical
problem. There are millions algorithms to do it, each class has
different properties such as number of steps or storage units.
However, if we look at these algorithms from the point of view of
distinct units, it's a new perspective on classification of
algorithms. Software are in general too ill-defined and fuzzy and
complex, the software we use on personal computers such as email,
browsers, games, don't even have mathematical models. They don't even
have mathematical models of their inputs and outputs. To talk about
automatic system of unitizing software, would be more like a AI
fantasy. Roughly, a term that describes this aspect of research is
Parallel computing.

In the Wikipedia article, it talks about types of parallelism: Bit-
level parallelism, Instruction-level parallelism, Data parallelism,
Task parallelism. Then it also discusses hardware aspects classes:
multicore, symmetric multiprocessing, distributed computing, cluster,
grid. The subjects mentioned there more close to this essay are “data
parallelism” and “task parallelism”. However, neither are high level
enough as i discussed here. The issue discussed here would be like
“algorithmic parallelism”. Indeed, Wikipedia mentioned “Automatic
parallelization”, which is exactly what i'm talking about here. Quote:

Automatic parallelization of a sequential program by a compiler is
the holy grail of parallel computing. Despite decades of work by
compiler researchers, automatic parallelization has had only limited
success.[40]

Mainstream parallel programming languages remain either explicitly
parallel or (at best) partially implicit, in which a programmer gives
the compiler directives for parallelization. A few fully implicit
parallel programming languages exist — SISAL, Parallel Haskell, and
(for FPGAs) Mitrion-C.

It says “A few fully implicit parallel programming languages exist”.
If you are a comp lang nutcase, don't get carried away by what those
words might seem to mean.

(Note: Wikipedia has a dedicated article on the subject: Automatic
parallelization)

Xah
∑ http://xahlee.org/


Roedy Green
2009-06-04 18:35:03 UTC
Permalink
• Why Must Software Be Rewritten For Multi-Core Processors?
Threads have been part of Java since Day 1. Using threads complicates
your code, but even with a single core processor, they can improve
performance, particularly if you are doing something like combing
multiple websites.

The nice thing about Java is whether you are on a single core
processor or a 256 CPU machine (We got to run our Athena Integer Java
spreadsheet engine on such a beast), does not concern your code.

You just have to make sure your threads don't interfere with each
other, and Java/the OS, handle exploiting all the CPUs available.
--
Roedy Green Canadian Mind Products
http://mindprod.com

Never discourage anyone... who continually makes progress, no matter how slow.
~ Plato 428 BC died: 348 BC at age: 80
Kaz Kylheku
2009-06-04 19:37:59 UTC
Permalink
["Followup-To:" header set to comp.lang.lisp.]
Post by Roedy Green
• Why Must Software Be Rewritten For Multi-Core Processors?
Threads have been part of Java since Day 1.
Unfortunately, not sane threads designed by people who actually understand
multithreading.
Post by Roedy Green
The nice thing about Java is whether you are on a single core
processor or a 256 CPU machine (We got to run our Athena Integer Java
spreadsheet engine on such a beast), does not concern your code.
You are dreaming if you think that there are any circumstances (other than
circumstances in which performance doesn't matter) in which you don't have to
concern yourself about the difference between a uniprocessor and a 256 CPU
machine.
Seamus MacRae
2009-06-05 17:12:37 UTC
Permalink
Kaz Kylheku wrote:
[snip]

I see you're no longer confining yourself to the thread from hell, and
now randomly crashing other threads comp.lang.java.programmer to
badmouth Java and crosspost into cll.
Post by Kaz Kylheku
Unfortunately, not sane threads designed by people who actually understand
multithreading.
Then let me enlighten you! We have had some nice concurrency support
over here in Java-land since Java 5, and especially Java 6, what with
java.util.concurrent, nonblocking streams in NIO, and
SwingWorker/SwingUtilities.
Vend
2009-06-05 17:26:20 UTC
Permalink
Post by Xah Lee
• Why Must Software Be Rewritten For Multi-Core Processors?
Threads have been part of Java since Day 1.  Using threads complicates
your code, but even with a single core processor, they can improve
performance, particularly if you are doing something like combing
multiple websites.
The nice thing about Java is whether you are on a single core
processor or a 256 CPU machine (We got to run our Athena Integer Java
spreadsheet engine on such a beast), does not concern your code.
You just have to make sure your threads don't interfere with each
other, and Java/the OS, handle exploiting all the CPUs available.
You need to decompose your problem in 256 independent tasks.

It can be trivial for some problems and difficult or perhaps
impossible for some others.
Kaz Kylheku
2009-06-05 18:15:00 UTC
Permalink
Post by Vend
Post by Xah Lee
• Why Must Software Be Rewritten For Multi-Core Processors?
Threads have been part of Java since Day 1.  Using threads complicates
your code, but even with a single core processor, they can improve
performance, particularly if you are doing something like combing
multiple websites.
The nice thing about Java is whether you are on a single core
processor or a 256 CPU machine (We got to run our Athena Integer Java
spreadsheet engine on such a beast), does not concern your code.
You just have to make sure your threads don't interfere with each
other, and Java/the OS, handle exploiting all the CPUs available.
You need to decompose your problem in 256 independent tasks.
It can be trivial for some problems and difficult or perhaps
impossible for some others.
Even for problems where it appears trivial, there can be hidden
issues, like false cache coherency communication where no actual
sharing is taking place. Or locks that appear to have low contention and
negligible performance impact on ``only'' 8 processors suddenly turn into
bottlenecks. Then there is NUMA. A given address in memory may be
RAM attached to the processor accessing it, or to another processor,
with very different access costs.
Roedy Green
2009-06-05 23:26:37 UTC
Permalink
On Fri, 5 Jun 2009 18:15:00 +0000 (UTC), Kaz Kylheku
Post by Kaz Kylheku
Even for problems where it appears trivial, there can be hidden
issues, like false cache coherency communication where no actual
sharing is taking place. Or locks that appear to have low contention and
negligible performance impact on ``only'' 8 processors suddenly turn into
bottlenecks. Then there is NUMA. A given address in memory may be
RAM attached to the processor accessing it, or to another processor,
with very different access costs.
Could what you are saying be summed up by saying, "The more threads
you have the more important it is to keep your threads independent,
sharing as little data as possible."
--
Roedy Green Canadian Mind Products
http://mindprod.com

Never discourage anyone... who continually makes progress, no matter how slow.
~ Plato 428 BC died: 348 BC at age: 80
Vend
2009-06-06 00:49:38 UTC
Permalink
Post by Roedy Green
On Fri, 5 Jun 2009 18:15:00 +0000 (UTC), Kaz Kylheku
Post by Kaz Kylheku
Even for problems where it appears trivial, there can be hidden
issues, like false cache coherency communication where no actual
sharing is taking place. Or locks that appear to have low contention and
negligible performance impact on ``only'' 8 processors suddenly turn into
bottlenecks. Then there is NUMA. A given address in memory may be
RAM attached to the processor accessing it, or to another processor,
with very different access costs.
Could what you are saying be summed up by saying, "The more threads
you have the more important it is to keep your threads independent,
sharing as little data as possible."
Besides technical issues such as cache conflicts and synchronization
latencies, there are more theoretical issues of task decomposability.
It seems it is not always feasible to decompose an algorithm into
subprograms that can be executed in parallel.
Raymond Wiker
2009-06-06 08:59:06 UTC
Permalink
Post by Roedy Green
On Fri, 5 Jun 2009 18:15:00 +0000 (UTC), Kaz Kylheku
Post by Kaz Kylheku
Even for problems where it appears trivial, there can be hidden
issues, like false cache coherency communication where no actual
sharing is taking place. Or locks that appear to have low contention and
negligible performance impact on ``only'' 8 processors suddenly turn into
bottlenecks. Then there is NUMA. A given address in memory may be
RAM attached to the processor accessing it, or to another processor,
with very different access costs.
Could what you are saying be summed up by saying, "The more threads
you have the more important it is to keep your threads independent,
sharing as little data as possible."
Absolutely... not a new observation, either, as it follows
directly from Amdahl's law.
George Neuner
2009-06-06 19:46:51 UTC
Permalink
On Fri, 05 Jun 2009 16:26:37 -0700, Roedy Green
Post by Roedy Green
On Fri, 5 Jun 2009 18:15:00 +0000 (UTC), Kaz Kylheku
Post by Kaz Kylheku
Even for problems where it appears trivial, there can be hidden
issues, like false cache coherency communication where no actual
sharing is taking place. Or locks that appear to have low contention and
negligible performance impact on ``only'' 8 processors suddenly turn into
bottlenecks. Then there is NUMA. A given address in memory may be
RAM attached to the processor accessing it, or to another processor,
with very different access costs.
Could what you are saying be summed up by saying, "The more threads
you have the more important it is to keep your threads independent,
sharing as little data as possible."
And therein lies the problem of leveraging many cores. There is a lot
of potential parallelism in programs (even in Java :) that is lost
because it is too fine a grain for threads. Even the lightest weight
user space ("green") threads need a few hundred instructions, minimum,
to amortize the cost of context switching.

Add to that the fact that programmers have shown themselves, on
average, to be remarkably bad at figuring out what _should_ be done in
parallel - as opposed to what _can_ be done - and you've got a clear
indicator that threads, as we know them, are not scalable except under
a limited set of conditions.

George
John Thingstad
2009-06-06 20:47:32 UTC
Permalink
På Sat, 06 Jun 2009 21:46:51 +0200, skrev George Neuner
Post by George Neuner
On Fri, 05 Jun 2009 16:26:37 -0700, Roedy Green
Add to that the fact that programmers have shown themselves, on
average, to be remarkably bad at figuring out what _should_ be done in
parallel - as opposed to what _can_ be done - and you've got a clear
indicator that threads, as we know them, are not scalable except under
a limited set of conditions.
George
I find the dataflow model of concurrency on Oz to be interesting and to
address many of the issues you just mentioned.
See in particular: 'Dataflow variables and declarative concurrency' and
onward.
http://en.wikipedia.org/wiki/Oz_(programming_language)


---------------------
John Thingstad
Paul Rubin
2009-06-07 02:58:55 UTC
Permalink
Post by George Neuner
Even the lightest weight
user space ("green") threads need a few hundred instructions, minimum,
to amortize the cost of context switching.
I thought the definition of green threads was that multiplexing them
doesn't require context switches.
Jeff M.
2009-06-07 03:08:29 UTC
Permalink
Post by Paul Rubin
Post by George Neuner
Even the lightest weight
user space ("green") threads need a few hundred instructions, minimum,
to amortize the cost of context switching.
I thought the definition of green threads was that multiplexing them
doesn't require context switches.
There's always a context switch. It's just whether or not you are
switching in/out a virtual stack and registers for the context or the
hardware stack/registers.

Jeff M.
Paul Rubin
2009-06-07 06:56:52 UTC
Permalink
Post by Jeff M.
Post by George Neuner
Even the lightest weight
user space ("green") threads need a few hundred instructions, minimum,
to amortize the cost of context switching....
There's always a context switch. It's just whether or not you are
switching in/out a virtual stack and registers for the context or the
hardware stack/registers.
I don't see the hundreds of instructions in that case.

http://shootout.alioth.debian.org/u32q/benchmark.php?test=threadring&lang=ghc&id=3

shows GHC doing 50 million lightweight thread switches in 8.47
seconds, passing a token around a thread ring. Almost all of that is
probably spent acquiring and releasing the token's lock as the token
is passed from one thread to another. That simply doesn't leave time
for hundreds of instructions per switch.
Scott David Daniels
2009-06-07 12:18:48 UTC
Permalink
Post by Paul Rubin
Post by Jeff M.
Post by George Neuner
Even the lightest weight
user space ("green") threads need a few hundred instructions, minimum,
to amortize the cost of context switching....
There's always a context switch. It's just whether or not you are
switching in/out a virtual stack and registers for the context or the
hardware stack/registers.
I don't see the hundreds of instructions in that case.
http://shootout.alioth.debian.org/u32q/benchmark.php?test=threadring&lang=ghc&id=3
shows GHC doing 50 million lightweight thread switches in 8.47
seconds, passing a token around a thread ring. Almost all of that is
probably spent acquiring and releasing the token's lock as the token
is passed from one thread to another. That simply doesn't leave time
for hundreds of instructions per switch.
Remember, he said, "to amortize the cost of the context switch," not to
perform the switch itself. I stutter-stepped there myself. One not
completely obvious place is in blowing the instruction and likely the
data) cache. I suspect it is tough to measure when that cost applies,
but I expect he's likely right, except on artificial benchmarks, and
the nub of the problem is not on the benchmarks. There is something
to be said for the good old daays when you looked up the instruction
timings that you used in a little document for your machine, and could
know the cost of any loop. We are faster now, but part of the cost of
that speed is that timing is a black art. Perhaps modern operating
systems need the syyem call that was implemented in the Stanfor AI Lab's
operating system -- phase of the moon.

--Scott David Daniels
***@Acm.Org
Lew
2009-06-07 15:16:46 UTC
Permalink
Post by Scott David Daniels
the nub of the problem is not on the benchmarks. There is something
to be said for the good old daays when you looked up the instruction
timings that you used in a little document for your machine, and could
know the cost of any loop. We are faster now, but part of the cost of
that speed is that timing is a black art.
Those good old days never existed. Those manuals never accounted for things
that affected timing even then, like memory latency or refresh time. SRAM
cache made things worse, since the published timings never mentioned
cache-miss delays. Though memory cache might seem a recent innovation, it's
been around a while. It would be challenging to find any published timing
since the commercialization of computers that would actually tell the cost of
any loop.

Things got worse when chips like the '86 family acquired multiple instructions
for doing loops, still worse when pre-fetch pipelines became deeper and wider,
absolutely Dark Art due to multi-level memory caches becoming universal, and
throw-your-hands-up-and-leave-for-the-corner-bar with multiprocessor NUMA
systems. OSes and high-level languages complicate the matter - you never know
how much time slice you'll get or how your source got compiled or optimized by
run-time.

So the good old days are a matter of degree and self-deception - it was easier
to fool ourselves then that we could at least guess timings proportionately if
not absolutely, but things definitely get more unpredictable over evolution.
--
Lew
Scott David Daniels
2009-06-07 15:46:28 UTC
Permalink
Post by Lew
Post by Scott David Daniels
the nub of the problem is not on the benchmarks. There is something
to be said for the good old daays when you looked up the instruction
timings that you used in a little document for your machine, and could
know the cost of any loop. We are faster now, but part of the cost of
that speed is that timing is a black art.
Those good old days never existed. Those manuals never accounted for
things that affected timing even then, like memory latency or refresh
time.
Well, as Gilbert and Sullivan wrote:
- What, never?
- No, never!
- What, "Never"?
- Well, hardly ever.
Look up the LGP-30. It was quite predictable. It has been a while.

--Scott David Daniels
***@Acm.Org
Jon Harrop
2009-06-07 18:43:25 UTC
Permalink
Post by Scott David Daniels
Post by Lew
Post by Scott David Daniels
the nub of the problem is not on the benchmarks. There is something
to be said for the good old daays when you looked up the instruction
timings that you used in a little document for your machine, and could
know the cost of any loop. We are faster now, but part of the cost of
that speed is that timing is a black art.
Those good old days never existed. Those manuals never accounted for
things that affected timing even then, like memory latency or refresh
time.
- What, never?
- No, never!
- What, "Never"?
- Well, hardly ever.
Look up the LGP-30. It was quite predictable. It has been a while.
Same for early ARMs.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Ken T.
2009-06-07 22:11:11 UTC
Permalink
Post by Lew
So the good old days are a matter of degree and self-deception - it was
easier to fool ourselves then that we could at least guess timings
proportionately if not absolutely, but things definitely get more
unpredictable over evolution.
As I recall I could get exact timings on my 6502 based Commodore 64. The
issues you speak of simply weren't issues.
--
Ken T.
http://www.electricsenator.net

Never underestimate the power of stupid people in large groups.
Jon Harrop
2009-06-08 01:39:40 UTC
Permalink
Post by Ken T.
Post by Lew
So the good old days are a matter of degree and self-deception - it was
easier to fool ourselves then that we could at least guess timings
proportionately if not absolutely, but things definitely get more
unpredictable over evolution.
As I recall I could get exact timings on my 6502 based Commodore 64. The
issues you speak of simply weren't issues.
Let's not forget Elite for the 6502 exploiting predictable performance in
order to switch graphics modes partway down the vsync!
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Ken T.
2009-06-08 04:28:20 UTC
Permalink
Post by Jon Harrop
Post by Ken T.
Post by Lew
So the good old days are a matter of degree and self-deception - it
was easier to fool ourselves then that we could at least guess timings
proportionately if not absolutely, but things definitely get more
unpredictable over evolution.
As I recall I could get exact timings on my 6502 based Commodore 64.
The issues you speak of simply weren't issues.
Let's not forget Elite for the 6502 exploiting predictable performance
in order to switch graphics modes partway down the vsync!
That actually didn't require predictable timing. You could tell the
video chip to send you an interrupt when it got to a given scan line. I
used this myself.

Elite was cool though. I wasted many hours playing that game.
--
Ken T.
http://www.electricsenator.net

Duct tape is like the force. It has a light side, and a dark side,
and it holds the universe together ...
-- Carl Zwanzig
r***@mpi-sws.org
2009-06-08 08:05:44 UTC
Permalink
Post by Jon Harrop
Let's not forget Elite for the 6502 exploiting predictable performance
in order to switch graphics modes partway down the vsync!
That actually didn't require predictable timing.  You could tell the
video chip to send you an interrupt when it got to a given scan line.  I
used this myself.  
I don't know what Elite did, but I know for sure that it was a common
trick on the Atari ST to switch color palettes or graphics mode at a
fixed point *in each single scan line* to get more colors, or display
graphics on the screen borders. That required "synchronous
programming", i.e. counting clock cycles of machine instructions such
that for every point in the program you knew exactly where the
electron ray would be.

The Atari ST had an M68000 with exactly 8 MHz, which made this
possible. There were no caches in those times, and clock cycles were
entirely predictable.
Paul Wallich
2009-06-08 14:33:36 UTC
Permalink
Post by r***@mpi-sws.org
Post by Ken T.
Post by Jon Harrop
Let's not forget Elite for the 6502 exploiting predictable performance
in order to switch graphics modes partway down the vsync!
That actually didn't require predictable timing. You could tell the
video chip to send you an interrupt when it got to a given scan line. I
used this myself.
I don't know what Elite did, but I know for sure that it was a common
trick on the Atari ST to switch color palettes or graphics mode at a
fixed point *in each single scan line* to get more colors, or display
graphics on the screen borders. That required "synchronous
programming", i.e. counting clock cycles of machine instructions such
that for every point in the program you knew exactly where the
electron ray would be.
The Atari ST had an M68000 with exactly 8 MHz, which made this
possible. There were no caches in those times, and clock cycles were
entirely predictable.
The usual trick for these machines was an exact multiple of the NTSC
"color clock", which was approx 3.58 MHz. The 8-bit atari video games
and home computers all used this technique, as did the C-64/128.
68000-based machines (such as the ST and the Amiga) could not only
exploit that synchrony, they could also (this was the days before memory
wall) exploit the fact that a 680x0 typically accessed memory only once
every 4 clock cycles to do DMA from the same memory when the CPU wasn't
using it. High display resolutions would lock the processor out of RAM
except during blanking intervals. (Talk about contention and hot spots.)

Figuring out how to reuse resources most effectively was pretty much the
same as the register-allocation problem for compilers, and was sometimes
solved using the same kinds of graph-coloring algorithms...
Jeff M.
2009-06-07 14:23:56 UTC
Permalink
Post by Jeff M.
Post by George Neuner
Even the lightest weight
user space ("green") threads need a few hundred instructions, minimum,
to amortize the cost of context switching....
There's always a context switch. It's just whether or not you are
switching in/out a virtual stack and registers for the context or the
hardware stack/registers.
I don't see the hundreds of instructions in that case.  
http://shootout.alioth.debian.org/u32q/benchmark.php?test=threadring&...
shows GHC doing 50 million lightweight thread switches in 8.47
seconds, passing a token around a thread ring.  Almost all of that is
probably spent acquiring and releasing the token's lock as the token
is passed from one thread to another.  That simply doesn't leave time
for hundreds of instructions per switch.
Who said there has to be? Sample code below (just to get the point
across):

struct context {
vir_reg pc, sp, bp, ... ;
object* stack;

// ...

context* next;
};

struct vm {
context* active_context;
};

void switch_context(vm* v)
{
// maybe GC v->active_context before switching

v->active_context = v->active_context->next;
}

Also, there isn't "hundreds of instructions" with multiplexing,
either. It's all done in hardware. Take a look at the disassembly for
any application: one that uses native threads on a platform that
supports preemption. You won't see any instructions anywhere in the
program that perform a context switch. If you did that would be
absolutely horrible. Imagine if the compiler did something like this:

while(1)
{
// whatever
}

do_context_switch_here();

That would suck. ;-)

That's not to imply that there isn't a cost; there's always a cost.
The example above just goes to show that for green threads, the cost
[of the switch] can be reduced down to a single pointer assignment.

Jeff M.
Jon Harrop
2009-06-07 15:21:58 UTC
Permalink
Post by Paul Rubin
Post by Jeff M.
Post by George Neuner
Even the lightest weight
user space ("green") threads need a few hundred instructions,
minimum, to amortize the cost of context switching....
There's always a context switch. It's just whether or not you are
switching in/out a virtual stack and registers for the context or the
hardware stack/registers.
I don't see the hundreds of instructions in that case.
http://shootout.alioth.debian.org/u32q/benchmark.php?test=threadring&lang=ghc&id=3
Post by Paul Rubin
shows GHC doing 50 million lightweight thread switches in 8.47
seconds, passing a token around a thread ring. Almost all of that is
probably spent acquiring and releasing the token's lock as the token
is passed from one thread to another. That simply doesn't leave time
for hundreds of instructions per switch.
And Haskell is not exactly fast...
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Jon Harrop
2009-06-07 15:20:56 UTC
Permalink
Post by George Neuner
On Fri, 05 Jun 2009 16:26:37 -0700, Roedy Green
Post by Roedy Green
On Fri, 5 Jun 2009 18:15:00 +0000 (UTC), Kaz Kylheku
Post by Kaz Kylheku
Even for problems where it appears trivial, there can be hidden
issues, like false cache coherency communication where no actual
sharing is taking place. Or locks that appear to have low contention and
negligible performance impact on ``only'' 8 processors suddenly turn into
bottlenecks. Then there is NUMA. A given address in memory may be
RAM attached to the processor accessing it, or to another processor,
with very different access costs.
Could what you are saying be summed up by saying, "The more threads
you have the more important it is to keep your threads independent,
sharing as little data as possible."
And therein lies the problem of leveraging many cores. There is a lot
of potential parallelism in programs (even in Java :) that is lost
because it is too fine a grain for threads.
That will always be true so it conveys no useful information to the
practitioner.
Post by George Neuner
Even the lightest weight
user space ("green") threads need a few hundred instructions, minimum,
to amortize the cost of context switching.
Work items in Cilk are much faster than that.
Post by George Neuner
Add to that the fact that programmers have shown themselves, on
average, to be remarkably bad at figuring out what _should_ be done in
parallel - as opposed to what _can_ be done - and you've got a clear
indicator that threads, as we know them, are not scalable except under
a limited set of conditions.
Parallelism is inherently not scalable. I see no merit in speculating about
the ramifications of "average" programmers alleged inabilities.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Jon Harrop
2009-06-07 15:14:56 UTC
Permalink
Post by Roedy Green
On Fri, 5 Jun 2009 18:15:00 +0000 (UTC), Kaz Kylheku
Post by Kaz Kylheku
Even for problems where it appears trivial, there can be hidden
issues, like false cache coherency communication where no actual
sharing is taking place. Or locks that appear to have low contention and
negligible performance impact on ``only'' 8 processors suddenly turn into
bottlenecks. Then there is NUMA. A given address in memory may be
RAM attached to the processor accessing it, or to another processor,
with very different access costs.
Could what you are saying be summed up by saying, "The more threads
you have the more important it is to keep your threads independent,
sharing as little data as possible."
I see no problem with mutable shared state.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Arved Sandstrom
2009-06-07 15:35:01 UTC
Permalink
Post by Jon Harrop
Post by Roedy Green
On Fri, 5 Jun 2009 18:15:00 +0000 (UTC), Kaz Kylheku
Post by Kaz Kylheku
Even for problems where it appears trivial, there can be hidden
issues, like false cache coherency communication where no actual
sharing is taking place. Or locks that appear to have low contention and
negligible performance impact on ``only'' 8 processors suddenly turn into
bottlenecks. Then there is NUMA. A given address in memory may be
RAM attached to the processor accessing it, or to another processor,
with very different access costs.
Could what you are saying be summed up by saying, "The more threads
you have the more important it is to keep your threads independent,
sharing as little data as possible."
I see no problem with mutable shared state.
In which case, Jon, you're in a small minority.

AHS
Jon Harrop
2009-06-07 18:41:43 UTC
Permalink
Post by Arved Sandstrom
Post by Jon Harrop
I see no problem with mutable shared state.
In which case, Jon, you're in a small minority.
No. Most programmers still care about performance and performance means
mutable state.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Joshua Cranmer
2009-06-07 19:35:34 UTC
Permalink
Post by Jon Harrop
No. Most programmers still care about performance and performance means
mutable state.
[ Citation needed ].

Most programmers I've met could care less about performance.
--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
Jon Harrop
2009-06-07 21:38:33 UTC
Permalink
Post by Joshua Cranmer
Post by Jon Harrop
No. Most programmers still care about performance and performance means
mutable state.
[ Citation needed ].
Most programmers I've met could care less about performance.
Then they have no need for parallelism in the first place.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Arved Sandstrom
2009-06-07 20:19:16 UTC
Permalink
Post by Jon Harrop
Post by Arved Sandstrom
Post by Jon Harrop
I see no problem with mutable shared state.
In which case, Jon, you're in a small minority.
No. Most programmers still care about performance and performance means
mutable state.
Quite apart from performance and mutable state, I believe we were
talking about mutable _shared_ state. And this is something that gets a
_lot_ of people into trouble.

AHS
Jon Harrop
2009-06-07 21:29:06 UTC
Permalink
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Post by Jon Harrop
I see no problem with mutable shared state.
In which case, Jon, you're in a small minority.
No. Most programmers still care about performance and performance means
mutable state.
Quite apart from performance and mutable state, I believe we were
talking about mutable _shared_ state. And this is something that gets a
_lot_ of people into trouble.
Nonsense. Scientists have been writing parallel programs for decades using
shared state extensively without whining about it. Databases are mutable
shared state but millions of database programmers solve real problems every
day without whining about it.

Use your common sense and you can write efficient parallel programs today
with little difficulty. I do.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Matti Nyk~nen
2009-06-09 07:36:42 UTC
Permalink
Post by Jon Harrop
Databases are mutable
shared state but millions of database programmers solve real problems every
day without whining about it.
This one is in fact a *counterexample* to Jon's claim, since the
database engine serializes concurrent transactions: that is, the
engine schedules the transactions so that they look like they had been
executed independently, one after the the other. To uphold this
illusion, the engine may even decide to abort and restart perfectly
valid transactions. In other words, a database makes shared state look
like it isn't, incurring a hefty performance penalty.

Indeed, isn't Software Transactional Memory (STM) as in Haskell
essentially a RAM database with optimistic concurrency control as its
scheduling principle?
Post by Jon Harrop
Use your common sense and you can write efficient parallel programs today
with little difficulty. I do.
So you are using my common sense? So that's why I keep running out! 8^)
Jon Harrop
2009-06-09 16:23:52 UTC
Permalink
Post by Matti Nyk~nen
Post by Jon Harrop
Databases are mutable
shared state but millions of database programmers solve real problems
every day without whining about it.
This one is in fact a *counterexample* to Jon's claim, since the
database engine serializes concurrent transactions: that is, the
engine schedules the transactions so that they look like they had been
executed independently, one after the the other.
Once the database programmer has manually encoded atomic sequences of
operations into transactions.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
George Neuner
2009-06-09 20:04:51 UTC
Permalink
Post by Matti Nyk~nen
Indeed, isn't Software Transactional Memory (STM) as in Haskell
essentially a RAM database with optimistic concurrency control as its
scheduling principle?
The effect is as if you have a RAM database, but the implementation is
not as complex.

STM doesn't necessarily organize the in-memory data in any particular
way (as a DBMS might make tables, for example). STM just needs to
maintain a log of memory addresses accessed and the values read from
and/or written to them. When a transaction tries to commit, the log
of values originally read from memory is compared to the current
values. If the read values are the same, the writes are committed to
memory and the transaction completes successfully. If the read values
differ, the transaction must be re-executed or aborted because the
computation was performed with outdated values.

George
Jeff M.
2009-06-07 21:41:39 UTC
Permalink
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Post by Jon Harrop
I see no problem with mutable shared state.
In which case, Jon, you're in a small minority.
No. Most programmers still care about performance and performance means
mutable state.
Quite apart from performance and mutable state, I believe we were
talking about mutable _shared_ state. And this is something that gets a
_lot_ of people into trouble.
Mutable shared state gets _bad_ (err.. perhaps "inexperienced" would
be a better adjective) programmers - who don't know what they are
doing - in trouble. There are many problem domains that either benefit
greatly from mutable shared states or can't [easily] be done without
them. Unified memory management being an obvious example... there are
many more. Unshared state has its place. Immutable state has its
place. Shared immutable state has its place. Shared mutable place has
its place.

Jeff M.
Jon Harrop
2009-06-07 21:51:48 UTC
Permalink
Post by Jeff M.
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Post by Jon Harrop
I see no problem with mutable shared state.
In which case, Jon, you're in a small minority.
No. Most programmers still care about performance and performance means
mutable state.
Quite apart from performance and mutable state, I believe we were
talking about mutable _shared_ state. And this is something that gets a
_lot_ of people into trouble.
Mutable shared state gets _bad_ (err.. perhaps "inexperienced" would
be a better adjective) programmers - who don't know what they are
doing - in trouble. There are many problem domains that either benefit
greatly from mutable shared states or can't [easily] be done without
them. Unified memory management being an obvious example... there are
many more. Unshared state has its place. Immutable state has its
place. Shared immutable state has its place. Shared mutable place has
its place.
Exactly. I don't believe that shared mutable state is any harder to do
correctly than the next solution and it is almost always the most efficient
solution and the sole purpose of writing parallel programs to leverage
multicores is performance.

A bad developer can screw up anything. I see no reason to think that shared
mutable state is any more fragile than the next thing a bad developer can
screw up.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Patricia Shanahan
2009-06-07 21:04:49 UTC
Permalink
Post by Jon Harrop
Post by Arved Sandstrom
Post by Jon Harrop
I see no problem with mutable shared state.
In which case, Jon, you're in a small minority.
No. Most programmers still care about performance and performance means
mutable state.
I don't see why that would affect whether one thinks there are problems.

In my opinion, shared mutable state has a lot of problems. It is also
sometimes the best design for performance reasons.

Patricia
Lew
2009-06-07 21:45:13 UTC
Permalink
Post by Patricia Shanahan
Post by Arved Sandstrom
Post by Jon Harrop
I see no problem with mutable shared state.
In which case, Jon, you're in a small minority.
In my opinion, shared mutable state has a lot of problems. It is also
sometimes the best design for performance reasons.
As Dr. Jon pointed out upthread, one can write decent code with mutable shared
state. It is also true that mutable state presents a lot of problems -
potential problems, ones that can be solved, but not ones that can be solved
thoughtlessly. On the flip side, one can write a tremendous amount of
effective multi-threaded code involving shared mutable state with attention to
a few rules of thumb, like always synchronize access and don't use different
monitors to do so.

Unlike some environments (e.g., database management systems), Java's tools to
manage concurrency are explicit and low level. The programmer's job is to
make sure those tools are used correctly to avoid problems. As long as they
do that, then there is no special problem with shared mutable state.

There is, however, a cost. Certain things must happen slower when you share
mutable state, than when you share immutable state or don't share state.
Certain things must happen when you share mutable state, regardless of speed,
because without them your code doesn't work. For some reason, concurrent
programming is an area often not well understood by a significant percentage
of workaday programmers. When problems do arise, they tend to be
probabilistic in nature and vary widely with system characteristics like
attempted load.

So the meeting ground is, yes, concurrent mutable state can present problems
if not properly managed. Properly managing such is not necessarily a huge
burden, but it must be borne. When done properly, shared mutable state will
not present problems in production.
--
Lew
Jon Harrop
2009-06-07 22:03:49 UTC
Permalink
Post by Lew
As Dr. Jon pointed out upthread, one can write decent code with mutable shared
state. It is also true that mutable state presents a lot of problems -
potential problems, ones that can be solved, but not ones that can be solved
thoughtlessly. On the flip side, one can write a tremendous amount of
effective multi-threaded code involving shared mutable state with
attention to a few rules of thumb, like always synchronize access and
don't use different monitors to do so.
Unlike some environments (e.g., database management systems), Java's tools to
manage concurrency are explicit and low level. The programmer's job is to
make sure those tools are used correctly to avoid problems. As long as
they do that, then there is no special problem with shared mutable state.
There is, however, a cost. Certain things must happen slower when you
share mutable state, than when you share immutable state or don't share
state. Certain things must happen when you share mutable state, regardless
of speed,
because without them your code doesn't work. For some reason, concurrent
programming is an area often not well understood by a significant percentage
of workaday programmers. When problems do arise, they tend to be
probabilistic in nature and vary widely with system characteristics like
attempted load.
So the meeting ground is, yes, concurrent mutable state can present problems
if not properly managed. Properly managing such is not necessarily a huge
burden, but it must be borne. When done properly, shared mutable state
will not present problems in production.
I agree entirely but my statements were about parallelism and not
concurrency. Parallel and concurrent programming have wildly different
characteristics and solutions. I don't believe shared mutable state is
overly problematic in the context of parallelism. Indeed, I think it is
usually the best solution in that context.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Lew
2009-06-07 22:03:48 UTC
Permalink
Post by Jon Harrop
I agree entirely but my statements were about parallelism and not
concurrency. Parallel and concurrent programming have wildly different
characteristics and solutions. I don't believe shared mutable state is
overly problematic in the context of parallelism. Indeed, I think it is
usually the best solution in that context.
Interesting distinction. Would it be fair to compare concurrent programming
to the bricks used to build the parallel program's edifice?
--
Lew
Arved Sandstrom
2009-06-07 23:21:14 UTC
Permalink
Post by Lew
Post by Jon Harrop
I agree entirely but my statements were about parallelism and not
concurrency. Parallel and concurrent programming have wildly different
characteristics and solutions. I don't believe shared mutable state is
overly problematic in the context of parallelism. Indeed, I think it is
usually the best solution in that context.
Interesting distinction. Would it be fair to compare concurrent
programming to the bricks used to build the parallel program's edifice?
Way too much of a fine distinction. While they are in fact different,
the point of concurrent programming is to structure programs as a group
of computations, which can be executed in parallel (however that might
actually be done depending on how many processors there are). Parallel
computing means to carry out many computations simultaneously. These are
interleaved definitions. And they are *not* wildly different.

If you talk about shared mutable state, it is not as easy to use as Dr
Harrop seems to think it is. Maybe in his experience it has been, but in
general it's no trivial thing to manage. Lew, you probably summarized it
best a few posts upstream.

AHS
Jon Harrop
2009-06-07 23:59:52 UTC
Permalink
Post by Arved Sandstrom
Post by Lew
Interesting distinction. Would it be fair to compare concurrent
programming to the bricks used to build the parallel program's edifice?
Way too much of a fine distinction. While they are in fact different,
the point of concurrent programming is to structure programs as a group
of computations, which can be executed in parallel (however that might
actually be done depending on how many processors there are).
No. Concurrent programming is about interleaving computations in order to
reduce latency. Nothing to do with parallelism.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Arved Sandstrom
2009-06-08 01:42:08 UTC
Permalink
Post by Jon Harrop
Post by Arved Sandstrom
Post by Lew
Interesting distinction. Would it be fair to compare concurrent
programming to the bricks used to build the parallel program's edifice?
Way too much of a fine distinction. While they are in fact different,
the point of concurrent programming is to structure programs as a group
of computations, which can be executed in parallel (however that might
actually be done depending on how many processors there are).
No. Concurrent programming is about interleaving computations in order to
reduce latency. Nothing to do with parallelism.
Jon, I do concurrent programming all the time, as do most of my peers.
Way down on the list of why we do it is the reduction of latency.

AHS
Jon Harrop
2009-06-08 02:42:32 UTC
Permalink
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Post by Lew
Interesting distinction. Would it be fair to compare concurrent
programming to the bricks used to build the parallel program's edifice?
Way too much of a fine distinction. While they are in fact different,
the point of concurrent programming is to structure programs as a group
of computations, which can be executed in parallel (however that might
actually be done depending on how many processors there are).
No. Concurrent programming is about interleaving computations in order to
reduce latency. Nothing to do with parallelism.
Jon, I do concurrent programming all the time, as do most of my peers.
Way down on the list of why we do it is the reduction of latency.
What is higher on the list?
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Arved Sandstrom
2009-06-10 02:08:36 UTC
Permalink
Post by Jon Harrop
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Post by Lew
Interesting distinction. Would it be fair to compare concurrent
programming to the bricks used to build the parallel program's edifice?
Way too much of a fine distinction. While they are in fact different,
the point of concurrent programming is to structure programs as a group
of computations, which can be executed in parallel (however that might
actually be done depending on how many processors there are).
No. Concurrent programming is about interleaving computations in order to
reduce latency. Nothing to do with parallelism.
Jon, I do concurrent programming all the time, as do most of my peers.
Way down on the list of why we do it is the reduction of latency.
What is higher on the list?
Correctness.

I'm not being facetious. I write applications that run on application
servers, and from time to time I have had to write various special
purpose servers. This kind of programming is all about managing
concurrent execution of computations. The overarching concern is
reliability and correct function. For many corporate situations, even
with hundreds of users, the actual load at any instant is low enough
that the various servers involved are nowhere close to being stressed
out - performance is a secondary issue.

AHS
Jon Harrop
2009-06-10 14:44:05 UTC
Permalink
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Post by Jon Harrop
No. Concurrent programming is about interleaving computations in order
to reduce latency. Nothing to do with parallelism.
Jon, I do concurrent programming all the time, as do most of my peers.
Way down on the list of why we do it is the reduction of latency.
What is higher on the list?
Correctness.
I'm not being facetious. I write applications that run on application
servers, and from time to time I have had to write various special
purpose servers. This kind of programming is all about managing
concurrent execution of computations. The overarching concern is
reliability and correct function. For many corporate situations, even
with hundreds of users, the actual load at any instant is low enough
that the various servers involved are nowhere close to being stressed
out - performance is a secondary issue.
In other words, without concurrency the latency would be so high that you
would consider the program to be wrong. However you cut it, the real reason
is latency.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Paul Rubin
2009-06-10 15:22:27 UTC
Permalink
Post by Jon Harrop
Post by Arved Sandstrom
I'm not being facetious. I write applications that run on application
servers, and from time to time I have had to write various special
purpose servers. This kind of programming is all about managing
concurrent execution of computations. The overarching concern is
reliability and correct function. For many corporate situations, even
with hundreds of users, the actual load at any instant is low enough
that the various servers involved are nowhere close to being stressed
out - performance is a secondary issue.
In other words, without concurrency the latency would be so high
that you would consider the program to be wrong. However you cut it,
the real reason is latency.
I don't think that follows, if there is two-way communication and
dependency between the servers, combined with lack of control over
when any particular server decides to initiate an outgoing request.
Stuff may have to happen concurrently to avoid complete deadlock.
Arved Sandstrom
2009-06-10 23:55:00 UTC
Permalink
Post by Jon Harrop
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Post by Jon Harrop
No. Concurrent programming is about interleaving computations in order
to reduce latency. Nothing to do with parallelism.
Jon, I do concurrent programming all the time, as do most of my peers.
Way down on the list of why we do it is the reduction of latency.
What is higher on the list?
Correctness.
I'm not being facetious. I write applications that run on application
servers, and from time to time I have had to write various special
purpose servers. This kind of programming is all about managing
concurrent execution of computations. The overarching concern is
reliability and correct function. For many corporate situations, even
with hundreds of users, the actual load at any instant is low enough
that the various servers involved are nowhere close to being stressed
out - performance is a secondary issue.
In other words, without concurrency the latency would be so high that you
would consider the program to be wrong. However you cut it, the real reason
is latency.
For a certain group of applications and user loads I would concede that
point, yes.

For quite a few other situations, you could queue up user requests and
execute them in order, finishing each before proceeding to the next, and
users wouldn't even notice. I wrote a J2SE server a few months ago, to
solve a very specific problem associated with an application running on
a J2EE server, that could handle dozens of users per second using this
strategy. It didn't write it that way because doing so is perverse, but
I could have.

AHS
Jeff M.
2009-06-10 14:50:37 UTC
Permalink
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Jon, I do concurrent programming all the time, as do most of my peers.
Way down on the list of why we do it is the reduction of latency.
What is higher on the list?
Correctness.
IMO, that response is a bit of a cop-out. Correctness is _always_ most
important, no matter what application you are creating; without it,
you don't have a job and the company you work for goes out of
business.

But, assuming that your program works and does what it's supposed to,
I agree with Jon that performance needs to be right near the top of
the list of concerns. Why? Performance isn't about looking good as a
programmer, or having fun making a function run in 15 cycles instead
of 24, or coming up with some neat bit packing scheme so that your app
now only uses 20K instead of 200K. Performance is - pure and simple -
about one thing only: money.

Programs that use more memory require more money for the hardware of
every user. Programs that run slower eat more time per day. If you
have 100,000 users, all doing an operation once per day that takes 20
seconds, being able to shave 5 seconds off that saves 5.78 man-days of
work. Hell, for some applications, that 20 seconds is just startup
time spent at a splash screen. Just imagine if every Google search
took even 5 seconds to resolve, how much time would be wasted every
day around the world - ignoring the fact that Google wouldn't exist if
that were the case ;-). Obviously Google engineers work incredibly
hard every day to ensure correct results, but performance better be
right up there at the top of the list as well.

Jeff M.
Matthias Blume
2009-06-10 15:10:03 UTC
Permalink
Post by Jeff M.
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Jon, I do concurrent programming all the time, as do most of my peers.
Way down on the list of why we do it is the reduction of latency.
What is higher on the list?
Correctness.
IMO, that response is a bit of a cop-out. Correctness is _always_ most
important, no matter what application you are creating; without it,
you don't have a job and the company you work for goes out of
business.
But, assuming that your program works and does what it's supposed to,
I agree with Jon that performance needs to be right near the top of
the list of concerns. Why? Performance isn't about looking good as a
programmer, or having fun making a function run in 15 cycles instead
of 24, or coming up with some neat bit packing scheme so that your app
now only uses 20K instead of 200K. Performance is - pure and simple -
about one thing only: money.
Programmer time is vastly more expensive than CPU time, so the
money argument often leads to slow ("low performance") solutions as long
as they are "good enough" because developing a faster solution would
mean spending more valuable programmer time at a cost that cannot
be recovered over the life cycle of the product in question.

That being said, there are plenty of situations where performance
obviously does matter a great deal -- as you correctly pointed out.
(It all depends on the above mentioned "product in question" and the
nature of its life cycle.)

Matthias
Jon Harrop
2009-06-10 19:42:09 UTC
Permalink
Post by Matthias Blume
Post by Jeff M.
But, assuming that your program works and does what it's supposed to,
I agree with Jon that performance needs to be right near the top of
the list of concerns. Why? Performance isn't about looking good as a
programmer, or having fun making a function run in 15 cycles instead
of 24, or coming up with some neat bit packing scheme so that your app
now only uses 20K instead of 200K. Performance is - pure and simple -
about one thing only: money.
Programmer time is vastly more expensive than CPU time, so the
money argument often leads to slow ("low performance") solutions as long
as they are "good enough" because developing a faster solution would
mean spending more valuable programmer time at a cost that cannot
be recovered over the life cycle of the product in question.
In the context of commercial software, the money to fund developers to
improve performance comes from the huge marketing budget because
performance is usually more about marketing than anything else.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Seamus MacRae
2009-06-10 17:49:30 UTC
Permalink
Post by Jeff M.
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Jon, I do concurrent programming all the time, as do most of my peers.
Way down on the list of why we do it is the reduction of latency.
What is higher on the list?
Correctness.
IMO, that response is a bit of a cop-out. Correctness is _always_ most
important, no matter what application you are creating; without it,
you don't have a job and the company you work for goes out of
business.
And when, exactly, did Microsoft go out of business? I hadn't heard it
mentioned in the news. :)
Jeff M.
2009-06-10 18:07:12 UTC
Permalink
Post by Seamus MacRae
Post by Jeff M.
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Jon, I do concurrent programming all the time, as do most of my peers.
Way down on the list of why we do it is the reduction of latency.
What is higher on the list?
Correctness.
IMO, that response is a bit of a cop-out. Correctness is _always_ most
important, no matter what application you are creating; without it,
you don't have a job and the company you work for goes out of
business.
And when, exactly, did Microsoft go out of business? I hadn't heard it
mentioned in the news. :)
Touche. :)

Jeff M.
Dimiter "malkia" Stanev
2009-06-10 18:26:24 UTC
Permalink
Post by Jeff M.
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Jon, I do concurrent programming all the time, as do most of my peers.
Way down on the list of why we do it is the reduction of latency.
What is higher on the list?
Correctness.
IMO, that response is a bit of a cop-out. Correctness is _always_ most
important, no matter what application you are creating; without it,
you don't have a job and the company you work for goes out of
business.
PC / Video Games definitely fall out of the correctness. As long as the
game does not crash your XBOX/PS3/Whatever for certain amount of time,
and behaves well then, it's fine.

Bugs are already part of the "genre".

In reality you can't ship on time, there are always BUGS :)

Most important thing in games is (at least for large percent of them)
speed of graphics - fluid 60fps, or stable 30fps.
Post by Jeff M.
But, assuming that your program works and does what it's supposed to,
I agree with Jon that performance needs to be right near the top of
the list of concerns. Why? Performance isn't about looking good as a
programmer, or having fun making a function run in 15 cycles instead
of 24, or coming up with some neat bit packing scheme so that your app
now only uses 20K instead of 200K. Performance is - pure and simple -
about one thing only: money.
Programs that use more memory require more money for the hardware of
every user. Programs that run slower eat more time per day. If you
have 100,000 users, all doing an operation once per day that takes 20
seconds, being able to shave 5 seconds off that saves 5.78 man-days of
work. Hell, for some applications, that 20 seconds is just startup
time spent at a splash screen. Just imagine if every Google search
took even 5 seconds to resolve, how much time would be wasted every
day around the world - ignoring the fact that Google wouldn't exist if
that were the case ;-). Obviously Google engineers work incredibly
hard every day to ensure correct results, but performance better be
right up there at the top of the list as well.
Jeff M.
Arved Sandstrom
2009-06-11 00:06:08 UTC
Permalink
Post by Jeff M.
Post by Arved Sandstrom
Post by Jon Harrop
Post by Arved Sandstrom
Jon, I do concurrent programming all the time, as do most of my peers.
Way down on the list of why we do it is the reduction of latency.
What is higher on the list?
Correctness.
IMO, that response is a bit of a cop-out. Correctness is _always_ most
important, no matter what application you are creating; without it,
you don't have a job and the company you work for goes out of
business.
But, assuming that your program works and does what it's supposed to,
I agree with Jon that performance needs to be right near the top of
the list of concerns. Why? Performance isn't about looking good as a
programmer, or having fun making a function run in 15 cycles instead
of 24, or coming up with some neat bit packing scheme so that your app
now only uses 20K instead of 200K. Performance is - pure and simple -
about one thing only: money.
Programs that use more memory require more money for the hardware of
every user. Programs that run slower eat more time per day. If you
have 100,000 users, all doing an operation once per day that takes 20
seconds, being able to shave 5 seconds off that saves 5.78 man-days of
work. Hell, for some applications, that 20 seconds is just startup
time spent at a splash screen. Just imagine if every Google search
took even 5 seconds to resolve, how much time would be wasted every
day around the world - ignoring the fact that Google wouldn't exist if
that were the case ;-). Obviously Google engineers work incredibly
hard every day to ensure correct results, but performance better be
right up there at the top of the list as well.
Jeff M.
Point taken, but I primarily work on internal government and corporate
applications. Might be hundreds or thousands of users at any given time,
but not typically tens or hundreds of thousands. Hundreds or thousands
of users translates to at least an order of magnitude less
"simultaneous" users. Ops people that I talk to who monitor apps I've
worked on, or similar apps, rarely report any such where the application
itself is presenting a sluggish aspect to users because it's stressing
out processors, or consuming gargantuan memory.

Typically when latency problems come up, it's because the app has to
talk to other servers - databases, LDAP, print servers etc. In other
words, the network is coming into play. In fact when we do work on
performance, it's typically about minimizing calls to the database or
other services.

AHS

Piet van Oostrum
2009-06-08 08:35:34 UTC
Permalink
By the way, there is a series of articles about concurrency on ACM Queue
which may be interesting for those participating in or just following
this discussion:

http://queue.acm.org/listing.cfm?item_topic=Concurrency&qc_type=theme_list&filter=Concurrency&page_title=Concurrency

Here is one introductory paragraph from one of the articles:

Parallel programming poses many new challenges to the developer, one of
which is synchronizing concurrent access to shared memory by multiple
threads. Programmers have traditionally used locks for synchronization,
but lock-based synchronization has well-known pitfalls. Simplistic
coarse-grained locking does not scale well, while more sophisticated
fine-grained locking risks introducing deadlocks and data races.
Furthermore, scalable libraries written using fine-grained locks cannot
be easily composed in a way that retains scalability and avoids deadlock
and data races.
--
Piet van Oostrum <***@cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: ***@vanoostrum.org
Seamus MacRae
2009-06-08 18:00:41 UTC
Permalink
Post by Piet van Oostrum
By the way, there is a series of articles about concurrency on ACM Queue
which may be interesting for those participating in or just following
http://queue.acm.org/listing.cfm?item_topic=Concurrency&qc_type=theme_list&filter=Concurrency&page_title=Concurrency
Parallel programming poses many new challenges to the developer, one of
which is synchronizing concurrent access to shared memory by multiple
threads. Programmers have traditionally used locks for synchronization,
but lock-based synchronization has well-known pitfalls. Simplistic
coarse-grained locking does not scale well, while more sophisticated
fine-grained locking risks introducing deadlocks and data races.
Furthermore, scalable libraries written using fine-grained locks cannot
be easily composed in a way that retains scalability and avoids deadlock
and data races.
Is that the one about transactional memory?
Piet van Oostrum
2009-06-09 06:00:12 UTC
Permalink
Post by Piet van Oostrum
By the way, there is a series of articles about concurrency on ACM Queue
which may be interesting for those participating in or just following
http://queue.acm.org/listing.cfm?item_topic=Concurrency&qc_type=theme_list&filter=Concurrency&page_title=Concurrency
Parallel programming poses many new challenges to the developer, one of
which is synchronizing concurrent access to shared memory by multiple
threads. Programmers have traditionally used locks for synchronization,
but lock-based synchronization has well-known pitfalls. Simplistic
coarse-grained locking does not scale well, while more sophisticated
fine-grained locking risks introducing deadlocks and data races.
Furthermore, scalable libraries written using fine-grained locks cannot
be easily composed in a way that retains scalability and avoids deadlock
and data races.
SM> Is that the one about transactional memory?
Yes
--
Piet van Oostrum <***@cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: ***@vanoostrum.org
Jon Harrop
2009-06-08 00:22:00 UTC
Permalink
Post by Lew
Post by Jon Harrop
I agree entirely but my statements were about parallelism and not
concurrency. Parallel and concurrent programming have wildly different
characteristics and solutions. I don't believe shared mutable state is
overly problematic in the context of parallelism. Indeed, I think it is
usually the best solution in that context.
Interesting distinction. Would it be fair to compare concurrent
programming to the bricks used to build the parallel program's edifice?
Concurrent programming certainly underpins the foundations of almost all
parallel programs. Not least at the level of the OS scheduling the threads
than run the parallel programs. However, that knowledge is probably more
confusing than helpful here.

In essence, concurrent programming is concerned with reducing latency (e.g.
evading blocking) by interleaving computations whereas parallel programming
is concerned with maximizing throughput by performing computations at the
same time.

Historically, concurrency has been of general interest on single core
machines in the context of operating systems and IO and has become more
important recently due to the ubiquity of web programming. Parallelism was
once only important to computational scientists programming shared-memory
supercomputers and enterprise developers programming distributed-memory
clusters but the advent of multicore machines on the desktop and in the
games console has pushed parallelism into the lime light for ordinary
developers when performance is important.

Solutions for concurrent and parallel programming are also wildly different.
Concurrent programming typically schedules work items that are expected to
block on a shared queue for a pool of dozens or even hundreds of threads.
Parallel programming typically schedules work items that are expected to
not block on wait-free work-stealing queues for a pool of "n" threads
where "n" is the number of cores. Solutions for concurrent programming
(such as the .NET thread pool and asynchronous workflows in F#) can be used
as a poor man's solution for parallel programming but the overheads are
huge because they were not designed for this purpose so performance is much
worse than necessary. Solutions for parallel programming (e.g. Cilk, the
Task Parallel Library) are virtually useless for concurrent programming
because you quickly end up with all "n" threads blocked and the whole
program stalls.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
Patricia Shanahan
2009-06-08 00:31:26 UTC
Permalink
Jon Harrop wrote:
...
Post by Jon Harrop
Historically, concurrency has been of general interest on single core
machines in the context of operating systems and IO and has become more
important recently due to the ubiquity of web programming. Parallelism was
once only important to computational scientists programming shared-memory
supercomputers and enterprise developers programming distributed-memory
clusters but the advent of multicore machines on the desktop and in the
games console has pushed parallelism into the lime light for ordinary
developers when performance is important.
...

Parallelism has also been important, for a long time, to multiprocessor
operating system developers. I got my first exposure to parallel
programming, in the 1980's, working on NCR's VRX operating system.

Patricia
toby
2009-06-09 17:47:11 UTC
Permalink
Post by Jon Harrop
Post by Arved Sandstrom
Post by Jon Harrop
I see no problem with mutable shared state.
In which case, Jon, you're in a small minority.
No. Most programmers still care about performance
Frequently when they shouldn't.
Post by Jon Harrop
and performance means
mutable state.
Hm, not sure Erlangers would wholly agree.
Post by Jon Harrop
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.http://www.ffconsultancy.com/?u
Jon Harrop
2009-06-09 18:59:41 UTC
Permalink
Post by toby
Post by Jon Harrop
Post by Arved Sandstrom
Post by Jon Harrop
I see no problem with mutable shared state.
In which case, Jon, you're in a small minority.
No. Most programmers still care about performance
Frequently when they shouldn't.
I disagree. A lot of software is still far too slow because the programmers
failed to pay attention to performance. Blogspot in Firefox being one
example: it can barely keep up with my typing!
Post by toby
Post by Jon Harrop
and performance means mutable state.
Hm, not sure Erlangers would wholly agree.
Erlang is about concurrency. This thread is about parallelism.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
George Neuner
2009-06-09 20:07:07 UTC
Permalink
On Tue, 9 Jun 2009 10:47:11 -0700 (PDT), toby
Post by toby
Post by Jon Harrop
performance means mutable state.
Hm, not sure Erlangers would wholly agree.
Erlang uses quite a bit of mutable state behind the scenes ... the
programmers just don't see it.

George
Dimiter "malkia" Stanev
2009-06-09 21:28:54 UTC
Permalink
Post by George Neuner
Erlang uses quite a bit of mutable state behind the scenes ... the
programmers just don't see it.
George
Heh... "The CPUs use quite a bit of mutable state behind the scenes ...
the programmers just don't see it."

Actually with CPU they see it more, than... say Erlang (that's why you
need to use fences/barriers/locked access here and there).
Scott Burson
2009-06-05 21:24:26 UTC
Permalink
Post by Xah Lee
• Why Must Software Be Rewritten For Multi-Core Processors?
 http://xahlee.org/UnixResource_dir/writ/multi-core_software.html
plain text version follows.
--------------------------------------------------
Why Must Software Be Rewritten For Multi-Core Processors?
Xah Lee, 2009-06-04
I had a revelation today, namely, that it is necessary to rewrite
software to use multi-processor in order to benefit from it.
This may sound stupid, but is a revelation to me. For the past decade,
the question has been on my mind, about why should software needs to
be rewritten to take advantage of multi-processors. Because, in my
mind, i thought that software are at some fundamental level just
algorithms, and algorithms, have nothing to do with hardware
implementation aspects such as number of processors. I always felt,
that those talks about the need or difficulty of rewriting software
for multi-processor (or multi-core these days) must be the product of
idiocy of industrial imperative coding monkies. In particular, some
languages such as java, the way they deal with it, seems to me
extremely stupid. e.g. the concept of threads. In my mind, there
should be a layer between the software and the hardware, such as the
operating system, or the compiler, that should be able to
automatically handle or compile the software so that it FULLY use the
multi-processors when present. In short, i thought that a algorithm
really has nothing to do with hardware details.
I never really thought hard about this issue, but today, since i got a
quad-core PC, so i looked into the issue, and thought about it, and i
realized the answer. The gist is that, algorithm, fundamentally means
manipulating some hardware, in fact, algorithm is a step by step
instruction about some form of hardware, even the hardware may be
abstract or virtual. For example, let's say the simplest case of 1+1.
It is a algorithm, but where is the hardware? You may say it's
abstract concept, or it being a mathematical model. What you call 1+1
depends on the context, but in our context, those numbers are the
hardware. To see this, lets say more complex example of listing primes
by sieve. Again, it comes down to “what is a number”? Here, numbers
can be stones, or arrangement of beads on abacus, it's hardware! As
another example, say sorting. To begin with, you have to have some
something to sort, that's hardware.
Another way to see this is that, for a given computing problem, there
are infinite number of algorithms to achieve the same thing. Some,
will be better ones, requiring less steps, or requiring less storage.
All these are concrete manipulation issues, and the thing being
manipulated, ultimately we have to call it hardware. So, when hardware
changes, say from one cpu to multi-cpu, there's no way for the
algorithm to magically change and adopt the changed hardware. If you
need a algorithm that is catered to the new hardware, you need a new
algorithm.
One might think that there might be algorithm Omega, such that it
takes input of old hardware O, new hardware N, and a algorithm A, and
output a algorithm B, such that B has the same behavior as A, but N+B
performs better than O+A. This is like asking for Strong AI.
One thing we often forgot is that algorithms ultimately translates to
manipulating hardware. In a modern digital computer, that means
software algorithms ultimately becomes machine instructions in CPU,
which manipulate the 1s and 0s in register, or electricity voltage in
transisters.
In a more mundane point of view, a automatic system for software to
work on multi-processors is a problem of breaking a problem into
discrete units (so that they can be computed in parallel). The problem
of finding a algorithm is entirely different from the problem of
breaking a problem into distinct units. The problem of breaking a
problem into distinct units is a entire new branch of mathematics. For
example, let's say factoring. Factoring is a well defined mathematical
problem. There are millions algorithms to do it, each class has
different properties such as number of steps or storage units.
However, if we look at these algorithms from the point of view of
distinct units, it's a new perspective on classification of
algorithms. Software are in general too ill-defined and fuzzy and
complex, the software we use on personal computers such as email,
browsers, games, don't even have mathematical models. They don't even
have mathematical models of their inputs and outputs. To talk about
automatic system of unitizing software, would be more like a AI
fantasy. Roughly, a term that describes this aspect of research is
Parallel computing.
In the Wikipedia article, it talks about types of parallelism: Bit-
level parallelism, Instruction-level parallelism, Data parallelism,
multicore, symmetric multiprocessing, distributed computing, cluster,
grid. The subjects mentioned there more close to this essay are “data
parallelism” and “task parallelism”. However, neither are high level
enough as i discussed here. The issue discussed here would be like
“algorithmic parallelism”. Indeed, Wikipedia mentioned “Automatic
    Automatic parallelization of a sequential program by a compiler is
the holy grail of parallel computing. Despite decades of work by
compiler researchers, automatic parallelization has had only limited
success.[40]
    Mainstream parallel programming languages remain either explicitly
parallel or (at best) partially implicit, in which a programmer gives
the compiler directives for parallelization. A few fully implicit
parallel programming languages exist — SISAL, Parallel Haskell, and
(for FPGAs) Mitrion-C.
It says “A few fully implicit parallel programming languages exist”.
If you are a comp lang nutcase, don't get carried away by what those
words might seem to mean.
(Note: Wikipedia has a dedicated article on the subject: Automatic
parallelization)
  Xah
∑http://xahlee.org/

Continue reading on narkive:
Loading...