Discussion:
Functional programming and large objects
(too old to reply)
kj
2010-10-25 00:28:41 UTC
Permalink
I don't use functional programming much in my work, and in fact I
know little about it. But I find the idea of controlling and
minimizing side effects attractive. (It is my understanding that
this is one of the features of functional programming.)

I am currently in the early stages of implementing a data
management/analysis/visualization system. This system is supposed
to make it easy for users to manage and manipulate large datasets
(typically dense multidimensional arrays of measurements).

I would like to design the system so that these large data objects
are immutable once created (the reason for this is that I think it
will be easier to implement a system that maitains data integrity
if I minimize the mutability of these objects). Making these
typically large objects immutable means that all manipulations of
these objects (e.g. extracting data subsets, transforming datasets,
combining datasets) would all require creating new objects, rather
than transforming existing objects in-place.

One major concern, of course, is that such a scheme would require
a lot of memory. At minimum it would require pretty aggressive
garbage collection. What other techniques are used in functional
programming to cope with such heightened demands on memory?

TIA!

~kj
Ertugrul Söylemez
2010-10-25 02:40:18 UTC
Permalink
I don't use functional programming much in my work, and in fact I know
little about it. But I find the idea of controlling and minimizing
side effects attractive. (It is my understanding that this is one of
the features of functional programming.)
I am currently in the early stages of implementing a data
management/analysis/visualization system. This system is supposed to
make it easy for users to manage and manipulate large datasets
(typically dense multidimensional arrays of measurements).
I would like to design the system so that these large data objects are
immutable once created (the reason for this is that I think it will be
easier to implement a system that maitains data integrity if I
minimize the mutability of these objects). Making these typically
large objects immutable means that all manipulations of these objects
(e.g. extracting data subsets, transforming datasets, combining
datasets) would all require creating new objects, rather than
transforming existing objects in-place.
One major concern, of course, is that such a scheme would require a
lot of memory. At minimum it would require pretty aggressive garbage
collection. What other techniques are used in functional programming
to cope with such heightened demands on memory?
Of course in functional programming you can also use mutable data
structures, and in fact, if you want this to be an array you have little
choice here. Immutable arrays will work, but as you noted correctly
that will blow the memory.

If your language is suitably pure and lazy the easiest way to get this
without blowing the memory is to use another data structure. For
example in Haskell there is Map for dictionaries. Since Haskell is
pure, there is no problem with functions returning new Maps, because
only the changes take new memory and the unused parts of the structure
are garbage-collected quickly.

In nonlazily evaluated languages I would probably use mutable data
structures.


Greets,
Ertugrul
--
nightmare = unsafePerformIO (getWrongWife >>= sex)
http://ertes.de/
Jon Harrop
2010-11-01 22:36:08 UTC
Permalink
Post by Ertugrul Söylemez
Of course in functional programming you can also use mutable data
structures, and in fact, if you want this to be an array you have little
choice here. Immutable arrays will work, but as you noted correctly
that will blow the memory.
If your language is suitably pure and lazy the easiest way to get this
without blowing the memory is to use another data structure. For
example in Haskell there is Map for dictionaries. Since Haskell is
pure, there is no problem with functions returning new Maps, because
only the changes take new memory and the unused parts of the structure
are garbage-collected quickly.
Actually the most common purely functional data structures including Map
benefit relatively little from non-strict evaluation and you will find
equivalent collections bundled with almost all strictly-evaluated functional
languages. For example, the Map modules in SML, OCaml and F#.

There are some more advanced data structures (e.g. real-time queues) that
require laziness in parts to achieve optimal asymptotic efficiency. For more
information, read Okasaki's excellent monograph "Purely functional data
structures".

Cheers,
Jon.
Ertugrul Söylemez
2010-11-06 16:24:50 UTC
Permalink
Post by Jon Harrop
Actually the most common purely functional data structures including
Map benefit relatively little from non-strict evaluation and you will
find equivalent collections bundled with almost all strictly-evaluated
functional languages. For example, the Map modules in SML, OCaml and
F#.
There is lazy evaluation and non-strict semantics. And yes, Map
benefits a lot from the former, not so much from the latter. I use Map
a lot to manipulate large databases usually together with a concurrent
thread, which manages them. It has the expected asymptotic behaviour,
even though it's immutable.


Greets,
Ertugrul
--
nightmare = unsafePerformIO (getWrongWife >>= sex)
http://ertes.de/
Dirk Thierbach
2010-10-25 08:27:23 UTC
Permalink
Post by kj
I don't use functional programming much in my work, and in fact I
know little about it. But I find the idea of controlling and
minimizing side effects attractive. (It is my understanding that
this is one of the features of functional programming.)
You can of course also do that in an imperative language.
Post by kj
I am currently in the early stages of implementing a data
management/analysis/visualization system. This system is supposed
to make it easy for users to manage and manipulate large datasets
(typically dense multidimensional arrays of measurements).
I would like to design the system so that these large data objects
are immutable once created [...] Making these typically large
objects immutable means that all manipulations of these objects
(e.g. extracting data subsets, transforming datasets, combining
datasets) would all require creating new objects, rather than
transforming existing objects in-place.
One major concern, of course, is that such a scheme would require
a lot of memory. At minimum it would require pretty aggressive
garbage collection.
Yes. FPL typically generate a lot of short-lived objects, and fewer
long-lived objects, so it tends to use generational garbage collectors.

Not sure if this applies to your situation.
Post by kj
What other techniques are used in functional programming to cope
with such heightened demands on memory?
Sharing. Immutability means that the original object won't be
changed, so if you construct a new object, you can refer safely
to the original object (or parts of it).

For example, you could implement simple subset or transformation
operations on the fly, without allocating new memory for them.

Or you could split up your measurement arrays into chunks, have them
garbage collected seperately, and share chunks between objects where
appropriate.

Again, nor sure if this is useful in your situation.

- Dirk
Torben Ægidius Mogensen
2010-10-25 09:08:46 UTC
Permalink
Post by kj
I would like to design the system so that these large data objects
are immutable once created
One major concern, of course, is that such a scheme would require
a lot of memory.
There are several answers to this:

- Enforce that only one copy of the large data is live at any one time.
Haskell uses monads for this, but you can also use linear types (like
Clean does). Basically, this allows the implementation to keep only
one copy that is destructively updated. Conceptually, you are making
a new data structure while retaining the old, but since the old data
structure is dead, you can use the same space for the new data
structure (and only modify the changed parts).

- Use data structures that allow easy sharing of parts. Instead of
representing an array like a large contiguous block of memory,
represent it as a tree structure or as an array of pointers to
smaller arrays. When you make a new updated copy of the data
structure, share as much as possible. For example, if you represent
a 1000-element array of integers as a 10-element array of pointers to
10-element arrys of pointers to 10-element arrays of integers
(basically a 10x10x10 array), you only need to copy 29 values when
making a random update compared to copying 9999 elements if you
represent the array as contiguous memory. The cost is that access to
a random element requires three memory loads instead of one.

- Use an update log. Basically, you keep the original array unchanged
and let the new copy be a pointer to the old array and a list of
changes to this. When accessing an element, first look through the
update log and then to the original structure. An update just adds
an element to the update log. By keeping the update log as a list,
update can be done in constant time, but access will take time
proportional to the number of updates. If the update log gets too
long, you can make a new updated copy of the data structure.
Basically, every time you follow a pointer in the update log (whether
for reading or writing), you make a new copy at probability 1/n,
where n is the size of the array. This way you amortize the copying
over (on average) n accesses and make the average access time linear
in the size (though with a bad worst case and a high constant
factor).

- Use an inverted update log. This is like above, but when you make an
update, you update the actual array and add the update log to the old
copy. This requires an extra indirection, but exploits the
observation that there will typically be more accesses to the new
copy than to the old.

By combining the methods above, you can make the worst-case access and
update time O(log(n)) and the average access time O(1), though with
fairly substantial constant factors for both.

Torben
Loading...