From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED,
	DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,HTML_MESSAGE,MAILING_LIST_MULTI
	autolearn=ham autolearn_force=no version=3.4.4
Received: (qmail 30344 invoked from network); 22 Mar 2023 02:52:40 -0000
Received: from minnie.tuhs.org (2600:3c01:e000:146::1)
  by inbox.vuxu.org with ESMTPUTF8; 22 Mar 2023 02:52:40 -0000
Received: from minnie.tuhs.org (localhost [IPv6:::1])
	by minnie.tuhs.org (Postfix) with ESMTP id 3607E4124D;
	Wed, 22 Mar 2023 12:52:35 +1000 (AEST)
Received: from mail-ua1-x936.google.com (mail-ua1-x936.google.com [IPv6:2607:f8b0:4864:20::936])
	by minnie.tuhs.org (Postfix) with ESMTPS id 282904124C
	for <tuhs@tuhs.org>; Wed, 22 Mar 2023 12:52:29 +1000 (AEST)
Received: by mail-ua1-x936.google.com with SMTP id ay14so11673718uab.13
        for <tuhs@tuhs.org>; Tue, 21 Mar 2023 19:52:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112; t=1679453548;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=OcsflNytYMH9guz9pxkHcIxbyZcRvbGgvwPwkPLFhoc=;
        b=lYG+ZQ/G6Asn/9gRkDdQ2AX0Vf+UN3Pc/AAbGWjT39b0VrnO2k5bdYO+QWw1LovZWt
         hKkY6urrDeQaeGRxjRlpzgJG8oNo1Pek7Je22kCUwZDeg7dYwOJEQ56YNAw5F68BCi5z
         H5cD7Dx7H6cV8/gfaERggNI/V+tYUpsdQdJf8nElQEEnk7ls8R6IDEDCKR9o8Zp+Mfsj
         Hhe5euPbyEsRn7wfDVDixGiJKcjF8qkX462jRwyf8rj8Krfx8sLMazf4gnN2iiRU8jZz
         j+6bUmnk+3U/NN4QimiXmQQ+EthYjVU7iX6GzjS6rHRLv6jU0Qu8zODrwWe4esRFOTdA
         Rmhw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679453548;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=OcsflNytYMH9guz9pxkHcIxbyZcRvbGgvwPwkPLFhoc=;
        b=xHeMjoDHE/sT0AOZoJkrX5ULI/+W7FI4CKPtUECLQjNUV7LV69FOI34STBgoSE4HAE
         SnQvzvZvvfAbmXF360mtVaAY9mwsapeZOnNsfvHDqRICeVstfoalxsOWLtMqSCbsvdU7
         Zt4kQHdrFMjLBmF1UO54lSn+umGh64hZCndXhUGoap/xNkclGDY01tht5UvkTMqLd86j
         yiN99A6s8Z1RiQZtrMR4zHKwLF2sAA4IFZvgwv8UKlMbnVwyM4MhLuxu/fNvPszVDwxH
         93Y+kIoeYt3PBxwnbIrywxLKL2dxoij1lZNoPB6FgbM/SbEzxyUAvomxW0/qv4jRbl6O
         6vSw==
X-Gm-Message-State: AO0yUKUuqpYOb1voqCcKTrGLrcehpb1UMWRspISyleYnjsetEabsobpt
	kqH+txT71DP8SXOKujCZgxy7LCW3GjGpIwIW5rIyQPqBde8=
X-Google-Smtp-Source: AK7set8r6RRt9nhuXH1xTDGdRdY/3a5SjI98LrVUKXmnTtVBEEYn5owerUuzU4IlH4/4G9UPnVhEKvPVWgJQGleMsRU=
X-Received: by 2002:a1f:20cb:0:b0:435:b6ce:23e7 with SMTP id
 g194-20020a1f20cb000000b00435b6ce23e7mr2644579vkg.0.1679453547580; Tue, 21
 Mar 2023 19:52:27 -0700 (PDT)
MIME-Version: 1.0
References: <Y8sUnihzhzTBOuKMJUnuV0DUZEqHb223xyoxXTmq-eMAe4HFZLgce38hxypW1K9UozOjAJxyXIpwzsWCfnZCRXTXictF--9hPEM__lviJ9A=@protonmail.com>
 <d90865c0-7c1c-c726-83c2-a7114e31bf19@aueb.gr> <20230319134701.3A262220F7@orac.inputplus.co.uk>
 <CAKzdPgxYPS3-XAJh2bzBRxSeJPO4tyietcNY+eyUvfYGR8fHow@mail.gmail.com> <20230322022526.GI3779@mcvoy.com>
In-Reply-To: <20230322022526.GI3779@mcvoy.com>
From: Rob Pike <robpike@gmail.com>
Date: Wed, 22 Mar 2023 13:52:16 +1100
Message-ID: <CAKzdPgyZEwESciO4HwQ0yGLQMX9PPTdQM40SK605ZPtJzozMGQ@mail.gmail.com>
To: Larry McVoy <lm@mcvoy.com>
Content-Type: multipart/alternative; boundary="0000000000001ed17a05f774401e"
Message-ID-Hash: NI6WPNPUYHGUZ7BHZBZBAIW424VTVTGZ
X-Message-ID-Hash: NI6WPNPUYHGUZ7BHZBZBAIW424VTVTGZ
X-MailFrom: robpike@gmail.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: tuhs@tuhs.org
X-Mailman-Version: 3.3.6b1
Precedence: list
Subject: [TUHS] Re: Bell Foreign-Language UNIX Efforts
List-Id: The Unix Heritage Society mailing list <tuhs.tuhs.org>
Archived-At: <https://www.tuhs.org/mailman3/hyperkitty/list/tuhs@tuhs.org/message/NI6WPNPUYHGUZ7BHZBZBAIW424VTVTGZ/>
List-Archive: <https://www.tuhs.org/mailman3/hyperkitty/list/tuhs@tuhs.org/>
List-Help: <mailto:tuhs-request@tuhs.org?subject=help>
List-Owner: <mailto:tuhs-owner@tuhs.org>
List-Post: <mailto:tuhs@tuhs.org>
List-Subscribe: <mailto:tuhs-join@tuhs.org>
List-Unsubscribe: <mailto:tuhs-leave@tuhs.org>

--0000000000001ed17a05f774401e
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Thanks for your support but C89 didn't specify an encoding. In classic
committee fashion, it refused to take a stand about anything that might
limit adoption. The problem was that the API it offered was clumsy and made
encoding errors hard to ignore. (Grepping a file for a string, do you
really care if there is an irrelevant binary blob in the middle that isn't
kosher UTF-8?) Also, it provided no support for printing "wide" characters.
This is all covered in the paper cited above.*

The original UTF was compatible with ASCII but not robust if there was an
alignment problem, and also used printable ASCII characters in multibyte
sequences. You could find a '/' inside a Cyrillic character encoding, which
broke Unix badly. That's why FSS-UTF, File-safe UTF, was the name given to
Prosser's variant.

It's wrong to give us credit for properties we didn't introduce. But UTF-8
is more regular, simpler to encode and decode, and more robust than its
predecessors. Most important, it did introduce the self-synchronization
property, which was the key that opened the door for us at X-Open.

-rob

* In a classic Usenix whoops, the paper had an appendix that described
UTF-8's encoding rigorously, but that was dropped when it was published in
the conference proceedings. Perhaps that's why the RFC got in the mix and
started some of the confusion about its origin.


On Wed, Mar 22, 2023 at 1:25=E2=80=AFPM Larry McVoy <lm@mcvoy.com> wrote:

> The brilliance of UTF-8 was to encode ASCII as is.  That seems obvious in
> retrospect but as Rob says, the multibyte crud in C89 was just awful,
> and that was the answer at the time.  Fitting ASCII in as is meant
> that all of the Unix utilities, sed, grep, awk, etc, had close to no
> performance hit if you were processing ascii.  That's pretty cool when
> you get that and you can process Japanese et al as well.
>
> I kind of cringe when I say it is brilliant to not break what exists
> already, to me, that's just part of what you do as an engineer.  But
> history has shown that not breaking stuff, fitting the new into the
> old, is brilliant.  So kudos to Rob and Ken for doing that (but truth
> be told, I'd be stunned if they didn't, they are great engineers).
>
> On Mon, Mar 20, 2023 at 07:27:34AM +1100, Rob Pike wrote:
> > As my mail quoted in
> > https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt says,
> > Ken worked out a new packing that avoided all the problems with the
> > existing ones. He didn't alter Prosser's encoding. UTF-8, as it was lat=
er
> > called, was not based on anything but it was deeply informed by a coupl=
e
> of
> > years of work coming to grips with the problem of programming with
> > multibyte characters. What Prosser did do, and what we - all of us - ar=
e
> > very grateful for, is start the conversation about replacing UTF with
> > something practical.
> >
> > (Speaking of design by committee, the multibyte stuff in C89 was
> atrocious,
> > and I heard was done in committee to get someone, perhaps the Japanese,
> to
> > sign off.)
> >
> > Regarding windows, Nathan Myrhvold visited Bell Labs around this time,
> and
> > we tried to talk to him about this, but he wasn't interested, claiming
> they
> > had it all worked out. We later learned what he meant, and lamented. No=
t
> > the only time someone wasn't open to hear an idea that might be worth
> > hearing, but an educational one.
> >
> > It's important historically to understand how all the forces came
> together
> > that day. The world was ready for a solution to international text, the
> > proposed character set was acceptable to most but the ASCII compatibili=
ty
> > issues were unbearable, the proposed solution to that was noxious,
> various
> > committees were starting to solve the problem in committee, leading to
> > technical briefs of varying quality, none right, and somehow a phone ca=
ll
> > was made one afternoon to a couple of people who had been thinking and
> > working these issues for ages, one of whom was a genius. And it all
> worked
> > out, which is truly unusual.
> >
> > -rob
>
> --
> ---
> Larry McVoy           Retired to fishing
> http://www.mcvoy.com/lm/boat
>

--0000000000001ed17a05f774401e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:arial,sa=
ns-serif">Thanks for your support but C89 didn&#39;t specify an encoding. I=
n classic committee fashion, it refused to take a stand about anything that=
 might limit adoption. The problem was that the API it offered was clumsy a=
nd made encoding errors hard to ignore. (Grepping a file for a string, do y=
ou really care if there is an irrelevant binary blob in the middle that isn=
&#39;t kosher UTF-8?) Also, it provided no support for printing &quot;wide&=
quot; characters. This is all covered in the paper cited above.*</div><div =
class=3D"gmail_default" style=3D"font-family:arial,sans-serif"><br></div><d=
iv class=3D"gmail_default" style=3D"font-family:arial,sans-serif">The origi=
nal UTF was compatible with ASCII but not robust if there was an alignment =
problem, and also used printable ASCII characters in multibyte sequences. Y=
ou could find a &#39;/&#39; inside a Cyrillic character encoding, which bro=
ke Unix badly. That&#39;s why FSS-UTF, File-safe UTF, was the name given to=
 Prosser&#39;s variant.</div><div class=3D"gmail_default" style=3D"font-fam=
ily:arial,sans-serif"><br></div><div class=3D"gmail_default" style=3D"font-=
family:arial,sans-serif">It&#39;s wrong to give us credit for properties we=
 didn&#39;t introduce. But UTF-8 is more regular, simpler to encode and dec=
ode, and more robust than its predecessors. Most important, it did introduc=
e the self-synchronization property, which was the key that opened the door=
 for us at X-Open.</div><div class=3D"gmail_default" style=3D"font-family:a=
rial,sans-serif"><br></div><div class=3D"gmail_default" style=3D"font-famil=
y:arial,sans-serif">-rob</div><div class=3D"gmail_default" style=3D"font-fa=
mily:arial,sans-serif"><br></div><div class=3D"gmail_default" style=3D"font=
-family:arial,sans-serif">* In a classic Usenix whoops, the paper had an ap=
pendix that described UTF-8&#39;s encoding rigorously, but that was dropped=
 when it was published in the conference proceedings. Perhaps that&#39;s wh=
y the RFC got in the mix and started some of the confusion about its origin=
.</div><div class=3D"gmail_default" style=3D"font-family:arial,sans-serif">=
<br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gm=
ail_attr">On Wed, Mar 22, 2023 at 1:25=E2=80=AFPM Larry McVoy &lt;<a href=
=3D"mailto:lm@mcvoy.com">lm@mcvoy.com</a>&gt; wrote:<br></div><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px soli=
d rgb(204,204,204);padding-left:1ex">The brilliance of UTF-8 was to encode =
ASCII as is.=C2=A0 That seems obvious in<br>
retrospect but as Rob says, the multibyte crud in C89 was just awful,<br>
and that was the answer at the time.=C2=A0 Fitting ASCII in as is meant<br>
that all of the Unix utilities, sed, grep, awk, etc, had close to no<br>
performance hit if you were processing ascii.=C2=A0 That&#39;s pretty cool =
when<br>
you get that and you can process Japanese et al as well.<br>
<br>
I kind of cringe when I say it is brilliant to not break what exists<br>
already, to me, that&#39;s just part of what you do as an engineer.=C2=A0 B=
ut<br>
history has shown that not breaking stuff, fitting the new into the <br>
old, is brilliant.=C2=A0 So kudos to Rob and Ken for doing that (but truth<=
br>
be told, I&#39;d be stunned if they didn&#39;t, they are great engineers).<=
br>
<br>
On Mon, Mar 20, 2023 at 07:27:34AM +1100, Rob Pike wrote:<br>
&gt; As my mail quoted in<br>
&gt; <a href=3D"https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt" rel=
=3D"noreferrer" target=3D"_blank">https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8=
-history.txt</a> says,<br>
&gt; Ken worked out a new packing that avoided all the problems with the<br=
>
&gt; existing ones. He didn&#39;t alter Prosser&#39;s encoding. UTF-8, as i=
t was later<br>
&gt; called, was not based on anything but it was deeply informed by a coup=
le of<br>
&gt; years of work coming to grips with the problem of programming with<br>
&gt; multibyte characters. What Prosser did do, and what we - all of us - a=
re<br>
&gt; very grateful for, is start the conversation about replacing UTF with<=
br>
&gt; something practical.<br>
&gt; <br>
&gt; (Speaking of design by committee, the multibyte stuff in C89 was atroc=
ious,<br>
&gt; and I heard was done in committee to get someone, perhaps the Japanese=
, to<br>
&gt; sign off.)<br>
&gt; <br>
&gt; Regarding windows, Nathan Myrhvold visited Bell Labs around this time,=
 and<br>
&gt; we tried to talk to him about this, but he wasn&#39;t interested, clai=
ming they<br>
&gt; had it all worked out. We later learned what he meant, and lamented. N=
ot<br>
&gt; the only time someone wasn&#39;t open to hear an idea that might be wo=
rth<br>
&gt; hearing, but an educational one.<br>
&gt; <br>
&gt; It&#39;s important historically to understand how all the forces came =
together<br>
&gt; that day. The world was ready for a solution to international text, th=
e<br>
&gt; proposed character set was acceptable to most but the ASCII compatibil=
ity<br>
&gt; issues were unbearable, the proposed solution to that was noxious, var=
ious<br>
&gt; committees were starting to solve the problem in committee, leading to=
<br>
&gt; technical briefs of varying quality, none right, and somehow a phone c=
all<br>
&gt; was made one afternoon to a couple of people who had been thinking and=
<br>
&gt; working these issues for ages, one of whom was a genius. And it all wo=
rked<br>
&gt; out, which is truly unusual.<br>
&gt; <br>
&gt; -rob<br>
<br>
-- <br>
---<br>
Larry McVoy=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Retired to fishing=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"http://www.mcvoy.com/lm/boat" re=
l=3D"noreferrer" target=3D"_blank">http://www.mcvoy.com/lm/boat</a><br>
</blockquote></div>

--0000000000001ed17a05f774401e--