Re: [Caml-list] SMP multithreading

From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Edgar Friendly <thelema314@gmail.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] SMP multithreading
Date: Tue, 16 Nov 2010 18:04:02 +0100	[thread overview]
Message-ID: <1289927042.16005.176.camel@thinkpad> (raw)
In-Reply-To: <4CE228CA.3030503@gmail.com>

Am Montag, den 15.11.2010, 22:46 -0800 schrieb Edgar Friendly:
> On 11/15/2010 09:27 AM, Wolfgang Draxinger wrote:
> > Hi,
> >
> > I've just read
> > http://caml.inria.fr/pub/ml-archives/caml-list/2002/11/64c14acb90cb14bedb2cacb73338fb15.en.html
> > in particular this paragraph:
> > | What about hyperthreading?  Well, I believe it's the last convulsive
> > | movement of SMP's corpse :-)  We'll see how it goes market-wise.  At
> > | any rate, the speedups announced for hyperthreading in the Pentium 4
> > | are below a factor of 1.5; probably not enough to offset the overhead
> > | of making the OCaml runtime system thread-safe.
> >
> > This reads just like the "640k ought be enough for everyone". Multicore
> > systems are the standard today. Even the cheapest consumer machines
> > come with at least two cores. Once can easily get 6 core machines today.
> >
> > Still thinking SMP was a niche and was dying?
> >
> > So, what're the developments regarding SMP multithreading OCaml?
> >
> >
> > Cheers
> >
> > Wolfgang
> >
> At the risk of feeding a (possibly unintentional) troll, I'd like to 
> share some possibly new thoughts on this ever-living topic.
> 
> It looks like high-performance computing of the near future will be 
> built out of many machines (message passing), each with many cores 
> (SMP).  One could use message passing for all communication in such a 
> system, but a hybrid approach might be best for this architecture, with 
> use of shared memory within each box and message passing between.  Of 
> course the best choice depends strongly on the particular task.
> 
> In the long run, it'll likely be a combination of a few large, powerful 
> cores (Intel-CPU style w/ the capability to run a single thread as fast 
> as possible) with many many smaller compute engines (GPGPUs or the like, 
> optimized for power and area, closely coupled with memory) that provides 
> the highest performance density.
> 
> The question of how to program such an architecture seems as if it's 
> being answered without the functional community's input. What can we 
> contribute?

Yes, that's generally the right question. Current hardware is a kind of
experiment - vendors have only taken the multicore path because it is
right now the easiest way of improving the performance potential,
although it is questionable whether (non-server) applications can really
benefit from it (excluding here server apps because for these
parallelization is relatively easy to get). Future hardware will
probably be even more different - however, it is still unclear which
design paths will be taken. Could be manycores (many CPUs with
non-uniform RAM), could be specialized compute units. Maybe we'll see
again a separation of consumer and datacenter markets - the former
optimizing for numeric simulation applications (i.e. games), the latter
for high-throughput data paths and parallel CPU power. The problem here
is that this is all speculation.

There are some things we can do to improve the situation (and some ideas
are not realistic):

      * A probably not-so-difficult improvement would be better message
        passing between independent but local processes. I've started an
        experiment for such a mechanism
        (http://projects.camlcity.org/projects/dl/ocamlnet-3.0.3/doc/html-main/Netcamlbox.html), which tries to exploit that GC-managed memory has a well-known structure. With more help from the GC this could be made even better (safer, fewer corner cases).
      * We need more frameworks for parallel programming. I'm currently
        developing Plasma, a Map/Reduce framework. Using a framework has
        the big advantage that the whole program is structured so it
        profits from parallelization, and that it is possible to train
        developers for it that have no idea about parallelization. There
        are probably more algorithm schemes where this is possible.
      * I have a lot of doubts whether FP languages ever run well on SMP
        with a bigger number of cores. The problem is the relatively
        high memory allocation rate - the GC has to work a lot harder
        than in imperative languages. The OC4MC project uses
        thread-local minor heaps because of this. Probably this is not
        enough, and one even needs thread-local major heaps (plus a
        third generation for values accessed by several threads). All in
        all you could get the same effect by instantiating the ocaml
        runtime several times (if this were possible), let each runtime
        run in its own thread, and provide some extra functionality for
        passing values between threads and for sharing values. This
        would not be exactly the SMP model, but would allow a number of
        parallelization techniques, and is probably future-proof as it
        encourages message passing over sharing. This is certainly worth
        experimentation.
      * One can also tackle the problem from the multi-processing side:
        Provide better mechanisms for message passing (see above) and
        value sharing. (That's probably the path I'll follow for my own
        experiments - no modifications of the runtime, but play tricks
        with the OS.)
      * As somebody mentioned "implicit parallelization": Don't expect
        anything from this. Even if a good compiler finds ways to
        parallelize 20% of the code (which would be a lot), the runtime
        effect would be marginal. 80% of the code is run at normal speed
        (hopefully) and dominates the runtime behavior. The point is
        that such compiler-driven code improvements are only local
        optimizations. For getting good parallelization results you need
        to restructure the design of the program - well, maybe
        compiler2.0 can do this at some time, but this is not in sight.
      * Looking for more "automatic" speedups: I would more look for
        parallelizing parts of the GC (e.g. parallelized sweep), but
        this is probably running quickly against the memory bandwidth
        limit. Maybe using 2 cores for the GC would result in an
        improvement, and more cores get you nothing extra. At least
        worth an experiment.

Gerd

> 
> E.
> 
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 

-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------