From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: tuhs-bounces@minnie.tuhs.org X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.7 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, HTML_MESSAGE,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham autolearn_force=no version=3.4.1 Received: from minnie.tuhs.org (minnie.tuhs.org [45.79.103.53]) by inbox.vuxu.org (OpenSMTPD) with ESMTP id cd012811 for ; Thu, 28 Jun 2018 02:41:37 +0000 (UTC) Received: by minnie.tuhs.org (Postfix, from userid 112) id 21DE7A1841; Thu, 28 Jun 2018 12:41:36 +1000 (AEST) Received: from minnie.tuhs.org (localhost [127.0.0.1]) by minnie.tuhs.org (Postfix) with ESMTP id 43569A1816; Thu, 28 Jun 2018 12:41:09 +1000 (AEST) Received: by minnie.tuhs.org (Postfix, from userid 112) id D2122A1816; Thu, 28 Jun 2018 12:41:06 +1000 (AEST) X-Greylist: delayed 38430 seconds by postgrey-1.35 at minnie.tuhs.org; Thu, 28 Jun 2018 12:41:05 AEST Received: from hapkido.dreamhost.com (hapkido.dreamhost.com [66.33.216.122]) by minnie.tuhs.org (Postfix) with ESMTPS id 692EB9EDF1 for ; Thu, 28 Jun 2018 12:41:05 +1000 (AEST) Received: from homiemail-a80.g.dreamhost.com (sub5.mail.dreamhost.com [208.113.200.129]) by hapkido.dreamhost.com (Postfix) with ESMTP id 9EA8B830D5 for ; Wed, 27 Jun 2018 09:00:34 -0700 (PDT) Received: from homiemail-a80.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a80.g.dreamhost.com (Postfix) with ESMTP id 3CD06E00083D; Wed, 27 Jun 2018 09:00:28 -0700 (PDT) Received: from localhost (alc-nat.dreamhost.com [66.33.206.8]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: scj@yaccman.com) by homiemail-a80.g.dreamhost.com (Postfix) with ESMTPSA id 5A722E000838; Wed, 27 Jun 2018 09:00:16 -0700 (PDT) Message-Id: From: "Steve Johnson" To: "Nelson H. F. Beebe" , tuhs@minnie.tuhs.org X-Mailer: Atmail 7.8.0.2 X-Originating-IP: 68.65.225.230 in-reply-to: Date: Wed, 27 Jun 2018 09:00:16 -0700 Content-Type: multipart/alternative; boundary="=_1da181a0dd65219b3ff9b24210fb71f4" MIME-Version: 1.0 Subject: Re: [TUHS] PDP-11 legacy, C, and modern architectures X-BeenThere: tuhs@minnie.tuhs.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: The Unix Heritage Society mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: tuhs-bounces@minnie.tuhs.org Sender: "TUHS" --=_1da181a0dd65219b3ff9b24210fb71f4 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable =0AI agree that C is a bad language for parallelism, and, like it or not= ,=0Athat's what today's hardware is giving us -- not speed, but many=0Ai= ndependent processors.=C2=A0 But I'd argue that its problem isn't that i= t=0Ais not low-level, but that it is not high-level enough.=C2=A0 A lang= uage=0Alike MATLAB, whose basic data object is an N-diemsional tensor, c= an=0Amake impressive use of parallel hardware.=0A=0AConsider matrix mult= iplication.=C2=A0=C2=A0 Multiplying two NxN arrays to get=0Aanother NxN= array is a classic data-parallel problem -- each value in=0Athe result= matrix is completely independent of every other one -- in=0Atheory, we= could dedicate a processor to each output element, and=0Awould not need= any cache coherency or locking mechanism -- just let=0Athem go at it --= the trickiest part is deciding you are finished.=0A=0AThe reason we kno= w we are data parallel is not because of any feature=0Aof the language -= - it's because of the mathematical structure of the=0Aproblem.=C2=A0 Whi= le it's easy to write a matrix multiply function in C=0A(as it is in mos= t languages), just the fact that the arguments are=0Apointers is enough= to make data parallelism invisible from within the=0Afunction.=C2=A0 Yo= u can bolt on additional features that, in effect, tell=0Athe compiler i= t should treat the inputs as independent and=0Anon-overlapping, but this= is just the tip of the iceberg -- real=0Aparallel problems see this in= spaces.=C2=A0 =0A=0AThe other hardware factor that comes into play is t= hat hardware,=0Aespecially memories, have physical limits in what they c= an do.=C2=A0 So=0Athe "ideal" matrix multiply with a processor for each= output element=0Awould suffer because many of the processors would be t= rying to read=0Athe same memory at the same time.=C2=A0 Some would be bo= und to fail,=0Arequiring the ability to stack requests and restart them,= as well as=0Apause the processor until the data was available.=C2=A0=C2= =A0 (note that, in=0Athis and many other cases, we don't need cache cohe= rency because the=0Ainput data is not changing while we are using it).= =C2=A0 The obvious way=0Aaround this is to divide the memory in to many= small memories that are=0Aclose to the processors, so memory access is= not the bottleneck.=0A=0AAnd this is where C (and Python) fall shortest= .=C2=A0 The idea that there=0Ais one memory space of semi-infinite size,= and all pointers point into=0Ait and all variables live in it almost fo= rces attempts at parallelism=0Ato be expensive and performance-killing.= =C2=A0 And yet, because of C's=0Alimited, "low-level" approach to data,= we are stuck.=C2=A0 Being able to=0Adeclare that something is a tensor= that will be unchanging when used,=0Acan be distributed across many sma= ll memories to prevent data=0Abottlenecks when reading and writing, and= changed only in limited and=0Acontrolled ways is the key to unlocking s= erious performance.=0A=0ASteve=0A=0APS: for some further thoughts, see= =0Ahttps://wavecomp.ai/blog/auto-hardware-and-ai=0A=0A --=_1da181a0dd65219b3ff9b24210fb71f4 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I agree that C is a bad language for parallelism, and, like= it or not, that's what today's hardware is giving us -- not speed, but= many independent processors.=C2=A0 But I'd argue that its problem isn't= that it is not low-level, but that it is not high-level enough.=C2=A0 A= language like MATLAB, whose basic data object is an N-diemsional tensor= , can make impressive use of parallel hardware.

Consider matrix m= ultiplication.=C2=A0=C2=A0 Multiplying two NxN arrays to get another NxN= array is a classic data-parallel problem -- each value in the result ma= trix is completely independent of every other one -- in theory, we could= dedicate a processor to each output element, and would not need any cac= he coherency or locking mechanism -- just let them go at it -- the trick= iest part is deciding you are finished.

The reason we know we are= data parallel is not because of any feature of the language -- it's bec= ause of the mathematical structure of the problem.=C2=A0 While it's easy= to write a matrix multiply function in C (as it is in most languages),= just the fact that the arguments are pointers is enough to make data pa= rallelism invisible from within the function.=C2=A0 You can bolt on addi= tional features that, in effect, tell the compiler it should treat the i= nputs as independent and non-overlapping, but this is just the tip of th= e iceberg -- real parallel problems see this in spaces.=C2=A0

Th= e other hardware factor that comes into play is that hardware, especiall= y memories, have physical limits in what they can do.=C2=A0 So the "idea= l" matrix multiply with a processor for each output element would suffer= because many of the processors would be trying to read the same memory= at the same time.=C2=A0 Some would be bound to fail, requiring the abil= ity to stack requests and restart them, as well as pause the processor u= ntil the data was available.=C2=A0=C2=A0 (note that, in this and many ot= her cases, we don't need cache coherency because the input data is not c= hanging while we are using it).=C2=A0 The obvious way around this is to= divide the memory in to many small memories that are close to the proce= ssors, so memory access is not the bottleneck.

And this is where= C (and Python) fall shortest.=C2=A0 The idea that there is one memory s= pace of semi-infinite size, and all pointers point into it and all varia= bles live in it almost forces attempts at parallelism to be expensive an= d performance-killing.=C2=A0 And yet, because of C's limited, "low-level= " approach to data, we are stuck.=C2=A0 Being able to declare that somet= hing is a tensor that will be unchanging when used, can be distributed a= cross many small memories to prevent data bottlenecks when reading and w= riting, and changed only in limited and controlled ways is the key to un= locking serious performance.

Steve

PS: for some further th= oughts, see https://wavecomp.ai/blog/auto-hardware-and-ai



--=_1da181a0dd65219b3ff9b24210fb71f4--