From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jonathandeanharrop@googlemail.com>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by yquem.inria.fr (Postfix) with ESMTP id 4AC1BBC57
	for <caml-list@yquem.inria.fr>; Wed, 17 Nov 2010 04:47:39 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AlcDAALf4kxKfVK0kGdsb2JhbACBfpIRgVWEUgGIAwgVAQEBAQkJDAcRAx+mPot6AQWOBQEEhUs
X-IronPort-AV: E=Sophos;i="4.59,209,1288566000"; 
   d="scan'208,217";a="78733500"
Received: from mail-wy0-f180.google.com ([74.125.82.180])
  by mail4-smtp-sop.national.inria.fr with ESMTP; 17 Nov 2010 04:47:38 +0100
Received: by wya21 with SMTP id 21so1519875wya.39
        for <caml-list@yquem.inria.fr>; Tue, 16 Nov 2010 19:47:38 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=googlemail.com; s=gamma;
        h=domainkey-signature:received:received:from:to:cc:references
         :in-reply-to:subject:date:organization:message-id:mime-version
         :content-type:x-mailer:thread-index:content-language;
        bh=6wEH7KcfxLbWsJtZ3uV9NToPcM4F0eoQEwqVJXZWFqY=;
        b=Tt7UIZPJd75o2+CCKR706GaDCY4iOYk1CIYXe36WIPGBcg1An++VOVs1iIoxBy7ubi
         D4ly7YX6mbsQtegD1B3kaICXYNFHzZXLncaauUxnXI5/A6v1li6nYNL8spCjiRFSsoTj
         c1TFChnPEQI7OadJA5bc44K06OS1JvLX/siyE=
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=googlemail.com; s=gamma;
        h=from:to:cc:references:in-reply-to:subject:date:organization
         :message-id:mime-version:content-type:x-mailer:thread-index
         :content-language;
        b=OIbwu2uwDkgWdKWJb+iEKJ7/MdKetIYQYytMiX4EGE0OuJzH3UCSHrTOOdmPek/4Mu
         WUF0RWF4lG4pxJfU3Tf80yIzpdCYiTmUWZz5Y4hJW/ojJkdQB1aH8Mc/GRGekdIjs+06
         CPSv4sZjwf6OJaJn8E9udm4X2WpfBvhn9z544=
Received: by 10.227.152.70 with SMTP id f6mr8646659wbw.28.1289965656706;
        Tue, 16 Nov 2010 19:47:36 -0800 (PST)
Received: from WinEight ([87.112.168.151])
        by mx.google.com with ESMTPS id f14sm1269278wbe.14.2010.11.16.19.47.34
        (version=SSLv3 cipher=RC4-MD5);
        Tue, 16 Nov 2010 19:47:35 -0800 (PST)
From: Jon Harrop <jonathandeanharrop@googlemail.com>
To: "'Eray Ozkural'" <examachine@gmail.com>
Cc: <caml-list@yquem.inria.fr>
References: <20101115182737.42b8dcae@loki.yggdrasil.draxit.de>	<4CE228CA.3030503@gmail.com> <1289927042.16005.176.camel@thinkpad>	<AANLkTim2r0S7ZSmXFK87-u2PKRqvyuiBkH5rQfp7ju_C@mail.gmail.com>	<1289945605.16005.205.camel@thinkpad> <AANLkTimgyL2xzUnuSzf-upSyjicffC-6-rgSK=EfxLTj@mail.gmail.com>
In-Reply-To: <AANLkTimgyL2xzUnuSzf-upSyjicffC-6-rgSK=EfxLTj@mail.gmail.com>
Subject: RE: [Caml-list] SMP multithreading
Date: Wed, 17 Nov 2010 03:47:14 -0000
Organization: Flying Frog Consultancy
Message-ID: <02be01cb860a$208f7aa0$61ae6fe0$@com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_02BF_01CB860A.208F7AA0"
X-Mailer: Microsoft Office Outlook 12.0
Thread-Index: AcuF4sLzuDqjoHmjSRqt9gVw+re+egAJqMjg
Content-Language: en-gb
X-Spam: no; 0.00; granularity:01 locality:01 mutation:01 cheers:01 eray:01 ozkural:01 gerd:01 stolpmann:01 gerd:01 stolpmann:01 eray:01 ozkural:01 compiler:01 runtime:01 runtime:01 

This is a multi-part message in MIME format.

------=_NextPart_000_02BF_01CB860A.208F7AA0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: 7bit

Granularity and cache complexity are the reasons why not. If you find
anything and everything that can be done in parallel and parallelize it then
you generally obtain only slowdowns. An essential trick is to exploit
locality via mutation but, of course, purely functional programming sucks at
that by design not least because it is striving to abstract that concept
away.

 
I share your dream but I doubt it will ever be realized.

 
Cheers,

Jon.

 
From: caml-list-bounces@yquem.inria.fr
[mailto:caml-list-bounces@yquem.inria.fr] On Behalf Of Eray Ozkural
Sent: 16 November 2010 23:05
To: Gerd Stolpmann
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] SMP multithreading

 
On Wed, Nov 17, 2010 at 12:13 AM, Gerd Stolpmann <info@gerd-stolpmann.de>
wrote:

Am Dienstag, den 16.11.2010, 22:35 +0200 schrieb Eray Ozkural:

>
>
> On Tue, Nov 16, 2010 at 7:04 PM, Gerd Stolpmann
> <info@gerd-stolpmann.de> wrote:
>         Am Montag, den 15.11.2010, 22:46 -0800 schrieb Edgar
>         Friendly:
>              * As somebody mentioned "implicit parallelization": Don't
>         expect
>                anything from this. Even if a good compiler finds ways
>         to
>                parallelize 20% of the code (which would be a lot), the
>         runtime
>                effect would be marginal. 80% of the code is run at
>         normal speed
>                (hopefully) and dominates the runtime behavior. The
>         point is
>                that such compiler-driven code improvements are only
>         local
>                optimizations. For getting good parallelization results
>         you need
>                to restructure the design of the program - well, maybe
>                compiler2.0 can do this at some time, but this is not
>         in sight.
>
> I think you are underestimating parallelizing compilers.

I was more citing Amdahl's law, and did not want to criticize any effort
in this area. It's more the usefulness for the majority of problems. How
useful is the best parallelizing compiler if only a small part of the
program _can_ actually benefit from it? Think about it. If you are not
working in an area where many subroutines can be sped up, you consider
this way of parallelizing as a waste of time. And this is still true for
the majority of problems. Also, for many problems that can be tackled,
the scalability is very limited.
  

What makes you think only 20% of the code can be parallelized? I've worked
in such a compiler project, and there were way too many opportunities for
parallelization in ordinary C code, let alone a functional language;
implicit parallelism would work wonders there. Of course you may think
whatever you were thinking, but I know that a high degree of parallelism can
be achieved through a functional language. I can't tell you much more
though. If you think that a very small portion of the code can be
parallelized you probably do not appreciate the kinds of static and dynamic
analysis those compilers perform. Of course if you are thinking of applying
it to some office or e-mail application, this might not be the case, but
automatic parallelization strategies would work best when you apply them to
a computationally intensive program.

The really limiting factor for current functional languages would be their
reliance on inherently sequential primitives like list processing, which may
in some cases limit the compiler to only pipeline parallelism (something
non-trivial to do by hand actually). Instead they would have to get their
basic forms from high-level parallel PLs (which might mean rewriting a lot
of things). The programs could look like something out of a category theory
textbook then. But I think even without modification you could get a lot of
speedup from ocaml code by applying state-of-the-art automatic
parallelization techniques. By the way, how much of the serial work is
parallelized that's what matters rather than the code length that is. Even
if the parallelism is not so obvious, the analysis can find out which
iterations are parallelizable, which variables have which kinds of
dependencies, memory dependencies, etc. etc. In the public, there is this
perception that automatic parallelization does not work, which is wrong,
while it is true that the popular compiler vendors do not understand much in
this regard, all I've seen (in popular products) are lame attempts to
generate vector instructions....

That is not to say, that implicit parallelization of current code can
replace parallel algorithm design (which is AI-complete). Rather, I think,
implicit parallelization is one of the things that will help parallel
computing people: by having them work at a more abstract level, exposing
more concurrency through functional forms, yet avoiding writing low-level
comms code. High-level explicit parallelism is also quite favorable, as
there may be many situations where, say, dynamic load-balancing approaches
are suitable. The approach of HPF might be relevant here, with the
programmer making annotations to guide the distribution of data structures,
and the rest inferred from code.

So whoever says that isn't possible, probably hasn't read up much in the
computer architecture community. Probably even expert PL researchers may be
misled here, as they have made the quoted remark or similar remarks about
multi-core/SMP  architectures. It was known for a very long time (20+ years)
that the clock speeds would hit a wall and then we'd have to expend more
area. This is true regardless of the underlying architecture/technology
actually. Mother nature is parallel. It is the sequence that is an
abstraction. And I wonder what is more natural than expressing a solution to
a problem in terms of functions and sets? The kind of non-leaky abstractions
required can be provided by a type-safe functional language and then the
compiler will have a lot more to work on than C code, although analysis can
reveal much about even C code. With the correct heuristics, it might decide
on good data and task distribution strategies, translating say a set
constructor notation to an efficient parallel algorithm, why not??

What I say applies to parallelization based on analysis, of course for
functional languages, other parallelization strategies are also possible, I
suppose low-level task parallelization and dynamic load-balancing strategies
(where functions or closures are distributed) are the most popular (the
reasoning being there can be many independent function evaluations
concurrently), I wonder if it has been attempted at all for ocaml? The
impurity of the language can be dealt with easily. And is there anyone who
might make suggestions to somebody who would like to work on automatic
parallelization of ocaml code? I actually think it might be fun to try some
of the simpler strategies and see how they yield on current hardware
platforms like multi-core and GPU. We ideally wouldn't like just good
automatic parallelization of, say, linear algebra code, but also cool stuff
like functional data structures. Recursive procedures can be mapped to
threads, it might match best hand-written parallelizations, actually, and
you'd get platform independence as a bonus if the compiler is well-designed
:) 

Best,


-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://groups.yahoo.com/group/ai-philosophy
http://myspace.com/arizanesil http://myspace.com/malfunct


------=_NextPart_000_02BF_01CB860A.208F7AA0
Content-Type: text/html;
	charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" =
xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta =
http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii"><meta name=3DGenerator content=3D"Microsoft Word 12 =
(filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
span.EmailStyle17
	{mso-style-type:personal-reply;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-GB link=3Dblue =
vlink=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Granularity and cache complexity are the reasons why not. If you find =
anything and everything that can be done in parallel and parallelize it =
then you generally obtain only slowdowns. An essential trick is to =
exploit locality via mutation but, of course, purely functional =
programming sucks at that by design not least because it is striving to =
abstract that concept away.<o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>I share your dream but I doubt it will ever be =
realized.<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Cheers,<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Jon.<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><div =
style=3D'border:none;border-left:solid blue 1.5pt;padding:0cm 0cm 0cm =
4.0pt'><div><div style=3D'border:none;border-top:solid #B5C4DF =
1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=3DMsoNormal><b><span =
lang=3DEN-US =
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span>=
</b><span lang=3DEN-US =
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'> =
caml-list-bounces@yquem.inria.fr =
[mailto:caml-list-bounces@yquem.inria.fr] <b>On Behalf Of </b>Eray =
Ozkural<br><b>Sent:</b> 16 November 2010 23:05<br><b>To:</b> Gerd =
Stolpmann<br><b>Cc:</b> caml-list@yquem.inria.fr<br><b>Subject:</b> Re: =
[Caml-list] SMP multithreading<o:p></o:p></span></p></div></div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal =
style=3D'margin-bottom:12.0pt'><o:p>&nbsp;</o:p></p><div><p =
class=3DMsoNormal>On Wed, Nov 17, 2010 at 12:13 AM, Gerd Stolpmann =
&lt;<a =
href=3D"mailto:info@gerd-stolpmann.de">info@gerd-stolpmann.de</a>&gt; =
wrote:<o:p></o:p></p><p class=3DMsoNormal>Am Dienstag, den 16.11.2010, =
22:35 +0200 schrieb Eray Ozkural:<o:p></o:p></p><div><p =
class=3DMsoNormal style=3D'margin-bottom:12.0pt'>&gt;<br>&gt;<br>&gt; On =
Tue, Nov 16, 2010 at 7:04 PM, Gerd Stolpmann<br>&gt; &lt;<a =
href=3D"mailto:info@gerd-stolpmann.de">info@gerd-stolpmann.de</a>&gt; =
wrote:<br>&gt; &nbsp; &nbsp; &nbsp; &nbsp; Am Montag, den 15.11.2010, =
22:46 -0800 schrieb Edgar<br>&gt; &nbsp; &nbsp; &nbsp; &nbsp; =
Friendly:<br>&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;* As =
somebody mentioned &quot;implicit parallelization&quot;: Don't<br>&gt; =
&nbsp; &nbsp; &nbsp; &nbsp; expect<br>&gt; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp;anything from this. Even if a good compiler =
finds ways<br>&gt; &nbsp; &nbsp; &nbsp; &nbsp; to<br>&gt; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;parallelize 20% of the code =
(which would be a lot), the<br>&gt; &nbsp; &nbsp; &nbsp; &nbsp; =
runtime<br>&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;effect would be marginal. 80% of the code is run at<br>&gt; &nbsp; =
&nbsp; &nbsp; &nbsp; normal speed<br>&gt; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp;(hopefully) and dominates the runtime =
behavior. The<br>&gt; &nbsp; &nbsp; &nbsp; &nbsp; point is<br>&gt; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;that such =
compiler-driven code improvements are only<br>&gt; &nbsp; &nbsp; &nbsp; =
&nbsp; local<br>&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;optimizations. For getting good parallelization results<br>&gt; =
&nbsp; &nbsp; &nbsp; &nbsp; you need<br>&gt; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp;to restructure the design of the program - =
well, maybe<br>&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;compiler2.0 can do this at some time, but this is not<br>&gt; =
&nbsp; &nbsp; &nbsp; &nbsp; in sight.<br>&gt;<br>&gt; I think you are =
underestimating parallelizing compilers.<o:p></o:p></p></div><p =
class=3DMsoNormal>I was more citing Amdahl's law, and did not want to =
criticize any effort<br>in this area. It's more the usefulness for the =
majority of problems. How<br>useful is the best parallelizing compiler =
if only a small part of the<br>program _can_ actually benefit from it? =
Think about it. If you are not<br>working in an area where many =
subroutines can be sped up, you consider<br>this way of parallelizing as =
a waste of time. And this is still true for<br>the majority of problems. =
Also, for many problems that can be tackled,<br>the scalability is very =
limited.<br>&nbsp; <o:p></o:p></p><div><p class=3DMsoNormal><br>What =
makes you think only 20% of the code can be parallelized? I've worked in =
such a compiler project, and there were way too many opportunities for =
parallelization in ordinary C code, let alone a functional language; =
implicit parallelism would work wonders there. Of course you may think =
whatever you were thinking, but I know that a high degree of parallelism =
can be achieved through a functional language. I can't tell you much =
more though. If you think that a very small portion of the code can be =
parallelized you probably do not appreciate the kinds of static and =
dynamic analysis those compilers perform. Of course if you are thinking =
of applying it to some office or e-mail application, this might not be =
the case, but automatic parallelization strategies would work best when =
you apply them to a computationally intensive program.<br><br>The really =
limiting factor for current functional languages would be their reliance =
on inherently sequential primitives like list processing, which may in =
some cases limit the compiler to only pipeline parallelism (something =
non-trivial to do by hand actually). Instead they would have to get =
their basic forms from high-level parallel PLs (which might mean =
rewriting a lot of things). The programs could look like something out =
of a category theory textbook then. But I think even without =
modification you could get a lot of speedup from ocaml code by applying =
state-of-the-art automatic parallelization techniques. By the way, how =
much of the serial work is parallelized that's what matters rather than =
the code length that is. Even if the parallelism is not so obvious, the =
analysis can find out which iterations are parallelizable, which =
variables have which kinds of dependencies, memory dependencies, etc. =
etc. In the public, there is this perception that automatic =
parallelization does not work, which is wrong, while it is true that the =
popular compiler vendors do not understand much in this regard, all I've =
seen (in popular products) are lame attempts to generate vector =
instructions....<br><br>That is not to say, that implicit =
parallelization of current code can replace parallel algorithm design =
(which is AI-complete). Rather, I think, implicit parallelization is one =
of the things that will help parallel computing people: by having them =
work at a more abstract level, exposing more concurrency through =
functional forms, yet avoiding writing low-level comms code. High-level =
explicit parallelism is also quite favorable, as there may be many =
situations where, say, dynamic load-balancing approaches are suitable. =
The approach of HPF might be relevant here, with the programmer making =
annotations to guide the distribution of data structures, and the rest =
inferred from code.<br><br>So whoever says that isn't possible, probably =
hasn't read up much in the computer architecture community. Probably =
even expert PL researchers may be misled here, as they have made the =
quoted remark or similar remarks about multi-core/SMP&nbsp; =
architectures. It was known for a very long time (20+ years) that the =
clock speeds would hit a wall and then we'd have to expend more area. =
This is true regardless of the underlying architecture/technology =
actually. Mother nature is parallel. It is the sequence that is an =
abstraction. And I wonder what is more natural than expressing a =
solution to a problem in terms of functions and sets? The kind of =
non-leaky abstractions required can be provided by a type-safe =
functional language and then the compiler will have a lot more to work =
on than C code, although analysis can reveal much about even C code. =
With the correct heuristics, it might decide on good data and task =
distribution strategies, translating say a set constructor notation to =
an efficient parallel algorithm, why not??<br><br>What I say applies to =
parallelization based on analysis, of course for functional languages, =
other parallelization strategies are also possible, I suppose low-level =
task parallelization and dynamic load-balancing strategies (where =
functions or closures are distributed) are the most popular (the =
reasoning being there can be many independent function evaluations =
concurrently), I wonder if it has been attempted at all for ocaml? The =
impurity of the language can be dealt with easily. And is there anyone =
who might make suggestions to somebody who would like to work on =
automatic parallelization of ocaml code? I actually think it might be =
fun to try some of the simpler strategies and see how they yield on =
current hardware platforms like multi-core and GPU. We ideally wouldn't =
like just good automatic parallelization of, say, linear algebra code, =
but also cool stuff like functional data structures. Recursive =
procedures can be mapped to threads, it might match best hand-written =
parallelizations, actually, and you'd get platform independence as a =
bonus if the compiler is well-designed :) =
<br><br>Best,<o:p></o:p></p></div></div><p class=3DMsoNormal =
style=3D'margin-bottom:12.0pt'><br>-- <br>Eray Ozkural, PhD =
candidate.&nbsp; Comp. Sci. Dept., Bilkent University, Ankara<br><a =
href=3D"http://groups.yahoo.com/group/ai-philosophy">http://groups.yahoo.=
com/group/ai-philosophy</a><br><a =
href=3D"http://myspace.com/arizanesil">http://myspace.com/arizanesil</a> =
<a =
href=3D"http://myspace.com/malfunct">http://myspace.com/malfunct</a><o:p>=
</o:p></p></div></div></body></html>
------=_NextPart_000_02BF_01CB860A.208F7AA0--