On Fri, May 9, 2008 at 11:00 PM, Till Varoquaux <till.varoquaux@gmail.com>
wrote:

> First of all let's try to stop the squabling and have some actual some
> discussions with actual content (trolling is very tempting and I am
> the first to fall for it). OCaml is extremly nice but not perfect.
> Other languages have other tradeoffs and the INRIA is not here to
> fullfill all our desires.
>
>
> On Fri, May 9, 2008 at 9:40 PM, Gerd Stolpmann <info@gerd-stolpmann.de>
> wrote:
> >
> > Am Freitag, den 09.05.2008, 19:10 +0100 schrieb Jon Harrop:
> >> On Friday 09 May 2008 12:12:00 Gerd Stolpmann wrote:
> >> > I think the parallelism capabilities are already excellent. We have
> been
> >> > able to implement the application backend of Wink's people search in
> >> > O'Caml, and it is of course a highly parallel system of programs. This
> >> > is not the same class raytracers or desktop parallelism fall into -
> this
> >> > is highly professional supercomputing. I'm talking about a cluster of
> >> > ~20 computers with something like 60 CPUs.
> >> >
> >> > Of course, we did not use multithreading very much. We are relying on
> >> > multi-processing (both "fork"ed style and separately started
> programs),
> >> > and multiplexing (i.e. application-driven micro-threading). I
> especially
> >> > like the latter: Doing multiplexing in O'Caml is fun, and a substitute
> >> > for most applications of multithreading. For example, you want to
> query
> >> > multiple remote servers in parallel: Very easy with multiplexing,
> >> > whereas the multithreaded counterpart would quickly run into
> scalability
> >> > problems (threads are heavy-weight, and need a lot of resources).
> >>
> >> If OCaml is good for concurrency on distributed systems that is great
> but it
> >> is completely different to CPU-bound parallelism on multicores.
> >
> > You sound like somebody who tries to sell hardware :-)
> >
> > Well, our algorithms are quite easy to parallelize. I don't see a
> > difference in whether they are CPU-bound or disk-bound - we also have
> > lots of CPU-bound stuff, and the parallelization strategies are the
> > same.
> >
> > The important thing is whether the algorithm can be formulated in a way
> > so that state mutations are rare, or can at least be done in a
> > "cache-friendly" way. Such algorithms exist for a lot of problems. I
> > don't know which problems you want to solve, but it sounds like as if it
> > were special problems. Like for most industries, most of our problems
> > are simply "do the same for N objects" where N is very large, and
> > sometimes "sort data", also for large N.
> >
> >> > In our case, the mutable data structures that count are on disk.
> >> > Everything else is only temporary state.
> >>
> >> Exactly. That is a completely different kettle of fish to writing high
> >> performance numerical codes for scientific computing.
> >
> > I don't understand. Relying on disk for sharing state is a big problem
> > for us, but unavoidable. Disk is slow memory with a very special timing.
> > Experience shows that even accessing state over the network is cheaper
> > than over disk. Often, we end up designing our algorithms around the
> > disk access characteristics. Compared to that the access to RAM-backed
> > state over network is fast and easy.
> >
> shm_open shares memories through file descriptors and, under
> linux/glibc, this done using /dev/shm. You can mmap this as a bigarray
> and, voila, shared memory. This is quite nice for numerical
> computation, plus you get closures etc... in your forks. Oh and COW on
> modern OS's makes this very cheap.


Yes, that's the kind of approach I like.

- Do not forget to do a Gc.compact before forking to avoid collecting the
same unreacahble data in each fork.

- For sharing complex data, you can marshall into a shared Bigarray.

If the speed of Marshal becomes a bottleneck, a specialized Marshal that
skips most of the checks/byte-oriented, compact serialization things that
extern.c currently does could speed
things up.

- A means for inter-process synchronization/communication is still needed.
A userland solution using a shared memory consensus algorithm (which would
probably require some
C or assembly for atomic operations) could be cheap.
-- 
Berke