From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: tuhs-bounces@minnie.tuhs.org
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.7 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	HTML_MESSAGE,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham
	autolearn_force=no version=3.4.1
Received: from minnie.tuhs.org (minnie.tuhs.org [45.79.103.53])
	by inbox.vuxu.org (OpenSMTPD) with ESMTP id cd012811
	for <ml@inbox.vuxu.org>;
	Thu, 28 Jun 2018 02:41:37 +0000 (UTC)
Received: by minnie.tuhs.org (Postfix, from userid 112)
	id 21DE7A1841; Thu, 28 Jun 2018 12:41:36 +1000 (AEST)
Received: from minnie.tuhs.org (localhost [127.0.0.1])
	by minnie.tuhs.org (Postfix) with ESMTP id 43569A1816;
	Thu, 28 Jun 2018 12:41:09 +1000 (AEST)
Received: by minnie.tuhs.org (Postfix, from userid 112)
 id D2122A1816; Thu, 28 Jun 2018 12:41:06 +1000 (AEST)
X-Greylist: delayed 38430 seconds by postgrey-1.35 at minnie.tuhs.org;
 Thu, 28 Jun 2018 12:41:05 AEST
Received: from hapkido.dreamhost.com (hapkido.dreamhost.com [66.33.216.122])
 by minnie.tuhs.org (Postfix) with ESMTPS id 692EB9EDF1
 for <tuhs@minnie.tuhs.org>; Thu, 28 Jun 2018 12:41:05 +1000 (AEST)
Received: from homiemail-a80.g.dreamhost.com (sub5.mail.dreamhost.com
 [208.113.200.129])
 by hapkido.dreamhost.com (Postfix) with ESMTP id 9EA8B830D5
 for <tuhs@minnie.tuhs.org>; Wed, 27 Jun 2018 09:00:34 -0700 (PDT)
Received: from homiemail-a80.g.dreamhost.com (localhost [127.0.0.1])
 by homiemail-a80.g.dreamhost.com (Postfix) with ESMTP id 3CD06E00083D;
 Wed, 27 Jun 2018 09:00:28 -0700 (PDT)
Received: from localhost (alc-nat.dreamhost.com [66.33.206.8])
 (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 (Authenticated sender: scj@yaccman.com)
 by homiemail-a80.g.dreamhost.com (Postfix) with ESMTPSA id 5A722E000838;
 Wed, 27 Jun 2018 09:00:16 -0700 (PDT)
Message-Id: <af780f9fb5c14e37f12ce5c2a4e40376669c730f@webmail.yaccman.com>
From: "Steve Johnson" <scj@yaccman.com>
To: "Nelson H. F. Beebe" <beebe@math.utah.edu>,
 tuhs@minnie.tuhs.org
X-Mailer: Atmail 7.8.0.2
X-Originating-IP: 68.65.225.230
in-reply-to: <CMM.0.96.0.1530035664.beebe@gamma.math.utah.edu>
Date: Wed, 27 Jun 2018 09:00:16 -0700
Content-Type: multipart/alternative;
 boundary="=_1da181a0dd65219b3ff9b24210fb71f4"
MIME-Version: 1.0
Subject: Re: [TUHS] PDP-11 legacy, C, and modern architectures
X-BeenThere: tuhs@minnie.tuhs.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: The Unix Heritage Society mailing list <tuhs.minnie.tuhs.org>
List-Unsubscribe: <https://minnie.tuhs.org/cgi-bin/mailman/options/tuhs>,
 <mailto:tuhs-request@minnie.tuhs.org?subject=unsubscribe>
List-Archive: <http://minnie.tuhs.org/pipermail/tuhs/>
List-Post: <mailto:tuhs@minnie.tuhs.org>
List-Help: <mailto:tuhs-request@minnie.tuhs.org?subject=help>
List-Subscribe: <https://minnie.tuhs.org/cgi-bin/mailman/listinfo/tuhs>,
 <mailto:tuhs-request@minnie.tuhs.org?subject=subscribe>
Errors-To: tuhs-bounces@minnie.tuhs.org
Sender: "TUHS" <tuhs-bounces@minnie.tuhs.org>

--=_1da181a0dd65219b3ff9b24210fb71f4
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

=0AI agree that C is a bad language for parallelism, and, like it or not=
,=0Athat's what today's hardware is giving us -- not speed, but many=0Ai=
ndependent processors.=C2=A0 But I'd argue that its problem isn't that i=
t=0Ais not low-level, but that it is not high-level enough.=C2=A0 A lang=
uage=0Alike MATLAB, whose basic data object is an N-diemsional tensor, c=
an=0Amake impressive use of parallel hardware.=0A=0AConsider matrix mult=
iplication.=C2=A0=C2=A0 Multiplying two NxN arrays to get=0Aanother NxN=
 array is a classic data-parallel problem -- each value in=0Athe result=
 matrix is completely independent of every other one -- in=0Atheory, we=
 could dedicate a processor to each output element, and=0Awould not need=
 any cache coherency or locking mechanism -- just let=0Athem go at it --=
 the trickiest part is deciding you are finished.=0A=0AThe reason we kno=
w we are data parallel is not because of any feature=0Aof the language -=
- it's because of the mathematical structure of the=0Aproblem.=C2=A0 Whi=
le it's easy to write a matrix multiply function in C=0A(as it is in mos=
t languages), just the fact that the arguments are=0Apointers is enough=
 to make data parallelism invisible from within the=0Afunction.=C2=A0 Yo=
u can bolt on additional features that, in effect, tell=0Athe compiler i=
t should treat the inputs as independent and=0Anon-overlapping, but this=
 is just the tip of the iceberg -- real=0Aparallel problems see this in=
 spaces.=C2=A0 =0A=0AThe other hardware factor that comes into play is t=
hat hardware,=0Aespecially memories, have physical limits in what they c=
an do.=C2=A0 So=0Athe "ideal" matrix multiply with a processor for each=
 output element=0Awould suffer because many of the processors would be t=
rying to read=0Athe same memory at the same time.=C2=A0 Some would be bo=
und to fail,=0Arequiring the ability to stack requests and restart them,=
 as well as=0Apause the processor until the data was available.=C2=A0=C2=
=A0 (note that, in=0Athis and many other cases, we don't need cache cohe=
rency because the=0Ainput data is not changing while we are using it).=
=C2=A0 The obvious way=0Aaround this is to divide the memory in to many=
 small memories that are=0Aclose to the processors, so memory access is=
 not the bottleneck.=0A=0AAnd this is where C (and Python) fall shortest=
.=C2=A0 The idea that there=0Ais one memory space of semi-infinite size,=
 and all pointers point into=0Ait and all variables live in it almost fo=
rces attempts at parallelism=0Ato be expensive and performance-killing.=
=C2=A0 And yet, because of C's=0Alimited, "low-level" approach to data,=
 we are stuck.=C2=A0 Being able to=0Adeclare that something is a tensor=
 that will be unchanging when used,=0Acan be distributed across many sma=
ll memories to prevent data=0Abottlenecks when reading and writing, and=
 changed only in limited and=0Acontrolled ways is the key to unlocking s=
erious performance.=0A=0ASteve=0A=0APS: for some further thoughts, see=
=0Ahttps://wavecomp.ai/blog/auto-hardware-and-ai=0A=0A

--=_1da181a0dd65219b3ff9b24210fb71f4
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html><body style=3D"font-family: Helvetica,Arial,sans-serif; font-size:=
 12px;"><br>I agree that C is a bad language for parallelism, and, like=
 it or not, that's what today's hardware is giving us -- not speed, but=
 many independent processors.=C2=A0 But I'd argue that its problem isn't=
 that it is not low-level, but that it is not high-level enough.=C2=A0 A=
 language like MATLAB, whose basic data object is an N-diemsional tensor=
, can make impressive use of parallel hardware.<br><br>Consider matrix m=
ultiplication.=C2=A0=C2=A0 Multiplying two NxN arrays to get another NxN=
 array is a classic data-parallel problem -- each value in the result ma=
trix is completely independent of every other one -- in theory, we could=
 dedicate a processor to each output element, and would not need any cac=
he coherency or locking mechanism -- just let them go at it -- the trick=
iest part is deciding you are finished.<br><br>The reason we know we are=
 data parallel is not because of any feature of the language -- it's bec=
ause of the mathematical structure of the problem.=C2=A0 While it's easy=
 to write a matrix multiply function in C (as it is in most languages),=
 just the fact that the arguments are pointers is enough to make data pa=
rallelism invisible from within the function.=C2=A0 You can bolt on addi=
tional features that, in effect, tell the compiler it should treat the i=
nputs as independent and non-overlapping, but this is just the tip of th=
e iceberg -- real parallel problems see this in spaces.=C2=A0 <br><br>Th=
e other hardware factor that comes into play is that hardware, especiall=
y memories, have physical limits in what they can do.=C2=A0 So the "idea=
l" matrix multiply with a processor for each output element would suffer=
 because many of the processors would be trying to read the same memory=
 at the same time.=C2=A0 Some would be bound to fail, requiring the abil=
ity to stack requests and restart them, as well as pause the processor u=
ntil the data was available.=C2=A0=C2=A0 (note that, in this and many ot=
her cases, we don't need cache coherency because the input data is not c=
hanging while we are using it).=C2=A0 The obvious way around this is to=
 divide the memory in to many small memories that are close to the proce=
ssors, so memory access is not the bottleneck.<br><br>And this is where=
 C (and Python) fall shortest.=C2=A0 The idea that there is one memory s=
pace of semi-infinite size, and all pointers point into it and all varia=
bles live in it almost forces attempts at parallelism to be expensive an=
d performance-killing.=C2=A0 And yet, because of C's limited, "low-level=
" approach to data, we are stuck.=C2=A0 Being able to declare that somet=
hing is a tensor that will be unchanging when used, can be distributed a=
cross many small memories to prevent data bottlenecks when reading and w=
riting, and changed only in limited and controlled ways is the key to un=
locking serious performance.<br><br>Steve<br><br>PS: for some further th=
oughts, see https://wavecomp.ai/blog/auto-hardware-and-ai<br><br><blockq=
uote class=3D"atmailquote"><br><br></blockquote></body></html>

--=_1da181a0dd65219b3ff9b24210fb71f4--