From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 91850BC57 for ; Wed, 17 Nov 2010 05:27:17 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AnQBAGPo4kxKfVM0k2dsb2JhbACBfpIRhicBhzZNCBUBAQEBCQkKCREDH6Y9iWKCGIUDLohZAQEDBYVGBIpY X-IronPort-AV: E=Sophos;i="4.59,209,1288566000"; d="scan'208";a="78734241" Received: from mail-gw0-f52.google.com ([74.125.83.52]) by mail4-smtp-sop.national.inria.fr with ESMTP; 17 Nov 2010 05:27:16 +0100 Received: by gwj22 with SMTP id 22so908966gwj.39 for ; Tue, 16 Nov 2010 20:27:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:cc:content-type; bh=smgjW6XzhyVQXnD6U9+wR+WoR3TZLdHjBi2Q9UT4fko=; b=pPNSd8Fl4UGLoVAOaA8zD6BmIitCH9WUcDLyx231yd6z9rl/QN7yGSJ/iOl8x6xs1v md+Hnx3+icKDMMbkFkVNvmWB0nP61siRuOSIECbt/waeGtU69sSsHdyqzxncqE8D6wkn K/GfpK2Xs0Oge+BqYLufxg8fC151PCy/2on3k= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=Rn6oLAqJGHqq9VBAZRlxBE2fliRx3efFIEJlR3al3ffvNQRCiHRpOiydhSA3vir1o8 6AxgbUNbkvmGGVWouEBUF1oFLVBmHqlWspXubGRY0ms+fukzIhlGRnYe7Pa+90IlSnfc DfXM5NQxob7mE7d/9jRFWxism1EZLHM8KcAx8= MIME-Version: 1.0 Received: by 10.90.227.2 with SMTP id z2mr10856411agg.32.1289968034394; Tue, 16 Nov 2010 20:27:14 -0800 (PST) Received: by 10.91.154.3 with HTTP; Tue, 16 Nov 2010 20:27:14 -0800 (PST) In-Reply-To: <02be01cb860a$208f7aa0$61ae6fe0$@com> References: <20101115182737.42b8dcae@loki.yggdrasil.draxit.de> <4CE228CA.3030503@gmail.com> <1289927042.16005.176.camel@thinkpad> <1289945605.16005.205.camel@thinkpad> <02be01cb860a$208f7aa0$61ae6fe0$@com> Date: Wed, 17 Nov 2010 06:27:14 +0200 Message-ID: Subject: Re: [Caml-list] SMP multithreading From: Eray Ozkural To: Jon Harrop Cc: caml-list@yquem.inria.fr Content-Type: multipart/alternative; boundary=00163630fd877631d90495381650 X-Spam: no; 0.00; eray:01 ozkural:01 speedup:01 hpc:01 haskell:01 haskell:01 compiler:01 doable:01 ocaml:01 compiler:01 parallelism:01 parallelism:01 speedup:01 ocaml:01 granularity:01 --00163630fd877631d90495381650 Content-Type: text/plain; charset=ISO-8859-1 Oh well, I'm not so surprised that the fine-grain task-parallelism with (?) dynamic load-balancing strategy doesn't get much speedup. Doing HPC with Haskell is a bit like using Java for writing parallel programs, you might as well use a C-64 and Commodore BASIC. And yes, some people do use Java with MPI. Java people have benchmarks too :) But for some reason I had difficulty using Java and Haskell even with medium size problems. On the other hand, a lot more can be achieved with a parallelizing compiler that uses profiling and static analysis. As I said even in C good results can be achieved, I've seen that, so I know it's doable with ocaml, just a difficult kind of compiler. The functional features would expose more concurrency. At any rate, implicit parallelism isn't the same as a parallelizing compiler, it's better, because you would be using primitives, that the compiler knows to its heart. That's like combining the best of both worlds, I think, because obviously parallelizing compilers can work best on the easier kinds of parallelism. It can pull more tricks than many assume, but it would still not replace a parallel algorithm designer. You don't really expect to give quicksort as input and get hypercube quicksort as output in these parallelizing compilers that apply a number of heuristic transformations to the code, but in many problems they can be made to generate pretty good code. The best part is once you have it you can apply it to every program, it's one of the cheapest ways to get speedup, so I'd say it's worthwhile for ocaml right now. Just not the way GHC does. On Wed, Nov 17, 2010 at 5:47 AM, Jon Harrop < jonathandeanharrop@googlemail.com> wrote: > Granularity and cache complexity are the reasons why not. If you find > anything and everything that can be done in parallel and parallelize it then > you generally obtain only slowdowns. An essential trick is to exploit > locality via mutation but, of course, purely functional programming sucks at > that by design not least because it is striving to abstract that concept > away. > > > > I share your dream but I doubt it will ever be realized. > > > > Cheers, > > Jon. > > > > *From:* caml-list-bounces@yquem.inria.fr [mailto: > caml-list-bounces@yquem.inria.fr] *On Behalf Of *Eray Ozkural > *Sent:* 16 November 2010 23:05 > *To:* Gerd Stolpmann > *Cc:* caml-list@yquem.inria.fr > *Subject:* Re: [Caml-list] SMP multithreading > > > > > > On Wed, Nov 17, 2010 at 12:13 AM, Gerd Stolpmann > wrote: > > Am Dienstag, den 16.11.2010, 22:35 +0200 schrieb Eray Ozkural: > > > > > > > On Tue, Nov 16, 2010 at 7:04 PM, Gerd Stolpmann > > wrote: > > Am Montag, den 15.11.2010, 22:46 -0800 schrieb Edgar > > Friendly: > > * As somebody mentioned "implicit parallelization": Don't > > expect > > anything from this. Even if a good compiler finds ways > > to > > parallelize 20% of the code (which would be a lot), the > > runtime > > effect would be marginal. 80% of the code is run at > > normal speed > > (hopefully) and dominates the runtime behavior. The > > point is > > that such compiler-driven code improvements are only > > local > > optimizations. For getting good parallelization results > > you need > > to restructure the design of the program - well, maybe > > compiler2.0 can do this at some time, but this is not > > in sight. > > > > I think you are underestimating parallelizing compilers. > > I was more citing Amdahl's law, and did not want to criticize any effort > in this area. It's more the usefulness for the majority of problems. How > useful is the best parallelizing compiler if only a small part of the > program _can_ actually benefit from it? Think about it. If you are not > working in an area where many subroutines can be sped up, you consider > this way of parallelizing as a waste of time. And this is still true for > the majority of problems. Also, for many problems that can be tackled, > the scalability is very limited. > > > > What makes you think only 20% of the code can be parallelized? I've worked > in such a compiler project, and there were way too many opportunities for > parallelization in ordinary C code, let alone a functional language; > implicit parallelism would work wonders there. Of course you may think > whatever you were thinking, but I know that a high degree of parallelism can > be achieved through a functional language. I can't tell you much more > though. If you think that a very small portion of the code can be > parallelized you probably do not appreciate the kinds of static and dynamic > analysis those compilers perform. Of course if you are thinking of applying > it to some office or e-mail application, this might not be the case, but > automatic parallelization strategies would work best when you apply them to > a computationally intensive program. > > The really limiting factor for current functional languages would be their > reliance on inherently sequential primitives like list processing, which may > in some cases limit the compiler to only pipeline parallelism (something > non-trivial to do by hand actually). Instead they would have to get their > basic forms from high-level parallel PLs (which might mean rewriting a lot > of things). The programs could look like something out of a category theory > textbook then. But I think even without modification you could get a lot of > speedup from ocaml code by applying state-of-the-art automatic > parallelization techniques. By the way, how much of the serial work is > parallelized that's what matters rather than the code length that is. Even > if the parallelism is not so obvious, the analysis can find out which > iterations are parallelizable, which variables have which kinds of > dependencies, memory dependencies, etc. etc. In the public, there is this > perception that automatic parallelization does not work, which is wrong, > while it is true that the popular compiler vendors do not understand much in > this regard, all I've seen (in popular products) are lame attempts to > generate vector instructions.... > > That is not to say, that implicit parallelization of current code can > replace parallel algorithm design (which is AI-complete). Rather, I think, > implicit parallelization is one of the things that will help parallel > computing people: by having them work at a more abstract level, exposing > more concurrency through functional forms, yet avoiding writing low-level > comms code. High-level explicit parallelism is also quite favorable, as > there may be many situations where, say, dynamic load-balancing approaches > are suitable. The approach of HPF might be relevant here, with the > programmer making annotations to guide the distribution of data structures, > and the rest inferred from code. > > So whoever says that isn't possible, probably hasn't read up much in the > computer architecture community. Probably even expert PL researchers may be > misled here, as they have made the quoted remark or similar remarks about > multi-core/SMP architectures. It was known for a very long time (20+ years) > that the clock speeds would hit a wall and then we'd have to expend more > area. This is true regardless of the underlying architecture/technology > actually. Mother nature is parallel. It is the sequence that is an > abstraction. And I wonder what is more natural than expressing a solution to > a problem in terms of functions and sets? The kind of non-leaky abstractions > required can be provided by a type-safe functional language and then the > compiler will have a lot more to work on than C code, although analysis can > reveal much about even C code. With the correct heuristics, it might decide > on good data and task distribution strategies, translating say a set > constructor notation to an efficient parallel algorithm, why not?? > > What I say applies to parallelization based on analysis, of course for > functional languages, other parallelization strategies are also possible, I > suppose low-level task parallelization and dynamic load-balancing strategies > (where functions or closures are distributed) are the most popular (the > reasoning being there can be many independent function evaluations > concurrently), I wonder if it has been attempted at all for ocaml? The > impurity of the language can be dealt with easily. And is there anyone who > might make suggestions to somebody who would like to work on automatic > parallelization of ocaml code? I actually think it might be fun to try some > of the simpler strategies and see how they yield on current hardware > platforms like multi-core and GPU. We ideally wouldn't like just good > automatic parallelization of, say, linear algebra code, but also cool stuff > like functional data structures. Recursive procedures can be mapped to > threads, it might match best hand-written parallelizations, actually, and > you'd get platform independence as a bonus if the compiler is well-designed > :) > > Best, > > > -- > Eray Ozkural, PhD candidate. Comp. Sci. Dept., Bilkent University, Ankara > http://groups.yahoo.com/group/ai-philosophy > http://myspace.com/arizanesil http://myspace.com/malfunct > -- Eray Ozkural, PhD candidate. Comp. Sci. Dept., Bilkent University, Ankara http://groups.yahoo.com/group/ai-philosophy http://myspace.com/arizanesil http://myspace.com/malfunct --00163630fd877631d90495381650 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Oh well, I'm not so surprised that the fine-grain task-parallelism with= (?) dynamic load-balancing strategy doesn't get much speedup. Doing HP= C with Haskell is a bit like using Java for writing parallel programs, you = might as well use a C-64 and Commodore BASIC. And yes, some people do use J= ava with MPI. Java people have benchmarks too :) But for some reason I had = difficulty using Java and Haskell even with medium size problems.

On the other hand, a lot more can be achieved with a parallelizing comp= iler that uses profiling and static analysis. As I said even in C good resu= lts can be achieved, I've seen that, so I know it's doable with oca= ml, just a difficult kind of compiler. The functional features would expose= more concurrency.

At any rate, implicit parallelism isn't the same as a parallelizing= compiler, it's better, because you would be using primitives, that the= compiler knows to its heart. That's like combining the best of both wo= rlds, I think, because obviously parallelizing compilers can work best on t= he easier kinds of parallelism. It can pull more tricks than many assume, b= ut it would still not replace a parallel algorithm designer.=A0 You don'= ;t really expect to give quicksort as input and get hypercube quicksort as = output in these parallelizing compilers that apply a number of heuristic tr= ansformations to the code, but in many problems they can be made to generat= e pretty good code. The best part is once you have it you can apply it to e= very program, it's one of the cheapest ways to get speedup, so I'd = say it's worthwhile for ocaml right now. Just not the way GHC does.

On Wed, Nov 17, 2010 at 5:47 AM, Jon Harrop = <= jonathandeanharrop@googlemail.com> wrote:

Granularity a= nd cache complexity are the reasons why not. If you find anything and every= thing that can be done in parallel and parallelize it then you generally ob= tain only slowdowns. An essential trick is to exploit locality via mutation= but, of course, purely functional programming sucks at that by design not = least because it is striving to abstract that concept away.

=A0

I share your dream but I doubt it will ever be re= alized.

=A0

Cheers,

Jon.

=A0

From: caml-list-bounces@yquem.inria.fr [mailto:caml-list-bounces@yqu= em.inria.fr] On Behalf Of Eray Ozkural
Sent: 16 November 2010 23:05
To: Gerd Stolpmann
Cc:<= /b> caml-list= @yquem.inria.fr
Subject: Re: [Caml-list] SMP multithreading

=A0

=A0

On Wed, Nov 17, 2010 at 12:13 AM, Gerd Stolpmann <info@gerd-stolpmann.de= > wrote:

Am Dienstag, den 16.11.2010, 22:35 +0200 schrieb Era= y Ozkural:

&g= t;
>
> On Tue, Nov 16, 2010 at 7:04 PM, Gerd Stolpmann
> = <info@gerd-s= tolpmann.de> wrote:
> =A0 =A0 =A0 =A0 Am Montag, den 15.11.2010, 22:46 -0800 schrieb Edgar> =A0 =A0 =A0 =A0 Friendly:
> =A0 =A0 =A0 =A0 =A0 =A0 =A0* As so= mebody mentioned "implicit parallelization": Don't
> = =A0 =A0 =A0 =A0 expect
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0anything from= this. Even if a good compiler finds ways
> =A0 =A0 =A0 =A0 to
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0parallelize = 20% of the code (which would be a lot), the
> =A0 =A0 =A0 =A0 runtime=
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0effect would be marginal. 80% of th= e code is run at
> =A0 =A0 =A0 =A0 normal speed
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(hopefully) and dominates the runtime b= ehavior. The
> =A0 =A0 =A0 =A0 point is
> =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0that such compiler-driven code improvements are only
> =A0= =A0 =A0 =A0 local
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0optimizations. Fo= r getting good parallelization results
> =A0 =A0 =A0 =A0 you need
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0to res= tructure the design of the program - well, maybe
> =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0compiler2.0 can do this at some time, but this is not
>= ; =A0 =A0 =A0 =A0 in sight.
>
> I think you are underestimating= parallelizing compilers.

I was more citing Amdahl's law, and did no= t want to criticize any effort
in this area. It's more the usefulnes= s for the majority of problems. How
useful is the best parallelizing com= piler if only a small part of the
program _can_ actually benefit from it? Think about it. If you are not
w= orking in an area where many subroutines can be sped up, you consider
th= is way of parallelizing as a waste of time. And this is still true for
the majority of problems. Also, for many problems that can be tackled,
t= he scalability is very limited.
=A0


= What makes you think only 20% of the code can be parallelized? I've wor= ked in such a compiler project, and there were way too many opportunities f= or parallelization in ordinary C code, let alone a functional language; imp= licit parallelism would work wonders there. Of course you may think whateve= r you were thinking, but I know that a high degree of parallelism can be ac= hieved through a functional language. I can't tell you much more though= . If you think that a very small portion of the code can be parallelized yo= u probably do not appreciate the kinds of static and dynamic analysis those= compilers perform. Of course if you are thinking of applying it to some of= fice or e-mail application, this might not be the case, but automatic paral= lelization strategies would work best when you apply them to a computationa= lly intensive program.

The really limiting factor for current functional languages would be th= eir reliance on inherently sequential primitives like list processing, whic= h may in some cases limit the compiler to only pipeline parallelism (someth= ing non-trivial to do by hand actually). Instead they would have to get the= ir basic forms from high-level parallel PLs (which might mean rewriting a l= ot of things). The programs could look like something out of a category the= ory textbook then. But I think even without modification you could get a lo= t of speedup from ocaml code by applying state-of-the-art automatic paralle= lization techniques. By the way, how much of the serial work is parallelize= d that's what matters rather than the code length that is. Even if the = parallelism is not so obvious, the analysis can find out which iterations a= re parallelizable, which variables have which kinds of dependencies, memory= dependencies, etc. etc. In the public, there is this perception that autom= atic parallelization does not work, which is wrong, while it is true that t= he popular compiler vendors do not understand much in this regard, all I= 9;ve seen (in popular products) are lame attempts to generate vector instru= ctions....

That is not to say, that implicit parallelization of current code can r= eplace parallel algorithm design (which is AI-complete). Rather, I think, i= mplicit parallelization is one of the things that will help parallel comput= ing people: by having them work at a more abstract level, exposing more con= currency through functional forms, yet avoiding writing low-level comms cod= e. High-level explicit parallelism is also quite favorable, as there may be= many situations where, say, dynamic load-balancing approaches are suitable= . The approach of HPF might be relevant here, with the programmer making an= notations to guide the distribution of data structures, and the rest inferr= ed from code.

So whoever says that isn't possible, probably hasn't read up mu= ch in the computer architecture community. Probably even expert PL research= ers may be misled here, as they have made the quoted remark or similar rema= rks about multi-core/SMP=A0 architectures. It was known for a very long tim= e (20+ years) that the clock speeds would hit a wall and then we'd have= to expend more area. This is true regardless of the underlying architectur= e/technology actually. Mother nature is parallel. It is the sequence that i= s an abstraction. And I wonder what is more natural than expressing a solut= ion to a problem in terms of functions and sets? The kind of non-leaky abst= ractions required can be provided by a type-safe functional language and th= en the compiler will have a lot more to work on than C code, although analy= sis can reveal much about even C code. With the correct heuristics, it migh= t decide on good data and task distribution strategies, translating say a s= et constructor notation to an efficient parallel algorithm, why not??

What I say applies to parallelization based on analysis, of course for = functional languages, other parallelization strategies are also possible, I= suppose low-level task parallelization and dynamic load-balancing strategi= es (where functions or closures are distributed) are the most popular (the = reasoning being there can be many independent function evaluations concurre= ntly), I wonder if it has been attempted at all for ocaml? The impurity of = the language can be dealt with easily. And is there anyone who might make s= uggestions to somebody who would like to work on automatic parallelization = of ocaml code? I actually think it might be fun to try some of the simpler = strategies and see how they yield on current hardware platforms like multi-= core and GPU. We ideally wouldn't like just good automatic parallelizat= ion of, say, linear algebra code, but also cool stuff like functional data = structures. Recursive procedures can be mapped to threads, it might match b= est hand-written parallelizations, actually, and you'd get platform ind= ependence as a bonus if the compiler is well-designed :)

Best,


--
Eray Ozkural, PhD candidate.=A0 Comp. Sci. Dept., Bilkent Un= iversity, Ankara
http://groups.yahoo.com/group/ai-philosophy
http://myspace.= com/arizanesil http://myspace.com/malfunct




--
Eray Ozkural, PhD candidate.=A0 Comp= . Sci. Dept., Bilkent University, Ankara
http://groups.yahoo.com/group/ai-philosophyhttp://myspace.com/arizanesil http://myspace.com/malfunct
--00163630fd877631d90495381650--