From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.1 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,MAILING_LIST_MULTI autolearn=ham autolearn_force=no
	version=3.4.4
Received: (qmail 8654 invoked from network); 22 Mar 2023 23:34:11 -0000
Received: from minnie.tuhs.org (50.116.15.146)
  by inbox.vuxu.org with ESMTPUTF8; 22 Mar 2023 23:34:11 -0000
Received: from minnie.tuhs.org (localhost [IPv6:::1])
	by minnie.tuhs.org (Postfix) with ESMTP id F0ABA41391;
	Thu, 23 Mar 2023 09:34:04 +1000 (AEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tuhs.org; s=dkim;
	t=1679528045; h=from:from:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:list-id:list-help:
	 list-owner:list-unsubscribe:list-subscribe:list-post;
	bh=zi+d8A4EJVY1gAiY51Ab/zTE+td75n1/VsKJhYUyodo=;
	b=DkxjKmrURCbDSDhIH+zVi1+iegjLfrodCaTVy6beS/prrPEdsZaRgxRyfWVafr8Xi6oh5e
	t+e+vEjoADvHyjMpFxIxbkaZYqG/z00RWoJKZeP7LBgsgNJe114spizjdF2XBZY5YfM36P
	vrcUEUOQkkk3Wwlf9rZC9qy5slm+KHc=
Received: from mail-4319.protonmail.ch (mail-4319.protonmail.ch [185.70.43.19])
	by minnie.tuhs.org (Postfix) with ESMTPS id BFAF34138E
	for <tuhs@tuhs.org>; Thu, 23 Mar 2023 09:33:59 +1000 (AEST)
Date: Wed, 22 Mar 2023 23:33:46 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=protonmail.com;
	s=protonmail3; t=1679528037; x=1679787237;
	bh=zi+d8A4EJVY1gAiY51Ab/zTE+td75n1/VsKJhYUyodo=;
	h=Date:To:From:Cc:Subject:Message-ID:In-Reply-To:References:
	 Feedback-ID:From:To:Cc:Date:Subject:Reply-To:Feedback-ID:
	 Message-ID:BIMI-Selector;
	b=vfjAAPThvNms69Ub2yxtHXZStNiJjTJl2SCKQuHOd6kAGcJC4kIFKGQu7t3c+ax+E
	 wWgtHReLqP/xcQ58Ely5IlePIZT7PpI7Kby1Y0ekldrTQU9TWxDwDPSoYvrOKyLsfh
	 Lzsc2RHVPzaakFWMZfO3OoK7uP71RH/Blj4nyOCO4+FCzlFHQVtUa7y5l7owLHBX2x
	 1EqytNid0POUzHiUwu6dJns2Ph4OGJjzp0qD79DM5/Rhh8OmZBibgs7dQOu4mWbljW
	 oc1ro5x42p+cqqpZOZ/Mm6hmWiKNNjPEDlnRd2XmYsQ393NMdv8yeJauHdSQjEYztd
	 rAMED6NDeUpaA==
To: Steffen Nurpmeso <steffen@sdaoden.eu>
Message-ID: <6CpABuCD3_HRWqKPC0cMvF19rgKZ0dztDXADlAsHnwpanC5-1pKJB7lscbhJOwKtjLLjypoW7e-Gp51Mc4UKizn1pGl4FHcgu2KjPhpJ2k0=@protonmail.com>
In-Reply-To: <20230322223307.S67m0%steffen@sdaoden.eu>
References: <Y8sUnihzhzTBOuKMJUnuV0DUZEqHb223xyoxXTmq-eMAe4HFZLgce38hxypW1K9UozOjAJxyXIpwzsWCfnZCRXTXictF--9hPEM__lviJ9A=@protonmail.com> <CAKzdPgyZEwESciO4HwQ0yGLQMX9PPTdQM40SK605ZPtJzozMGQ@mail.gmail.com> <de9a2a47-3fad-c15b-b2f2-80d555aa9955@mehdix.org> <CAKzdPgx=9NOShRa+AWvKMVMYLv=fg8cPPGPCVm3dWkaH8hHgdw@mail.gmail.com> <202303220740.32M7eprr032005@freefriends.org> <CAA1C+h2_AVitY2E8B5YKpdjsJ6eEtDy=1S=GobVpnLiT9TgSSA@mail.gmail.com> <CAA1C+h2KuP95QjvmnMCCgDsCbmXXU-bwnmTyryrH6KTdj7XjFg@mail.gmail.com> <CAKzdPgwYPxK9oYemG5-vPgRR7mSfj_qkjD5-iJnLffP-23PUaQ@mail.gmail.com> <20230322223307.S67m0%steffen@sdaoden.eu>
Feedback-ID: 35591162:user:proton
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Message-ID-Hash: 43TVM5XSMWT7PCVK5GKLVCMP3Y5IXWHY
X-Message-ID-Hash: 43TVM5XSMWT7PCVK5GKLVCMP3Y5IXWHY
X-MailFrom: segaloco@protonmail.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: tuhs@tuhs.org
X-Mailman-Version: 3.3.6b1
Precedence: list
Subject: [TUHS] Re: Bell Foreign-Language UNIX Efforts
List-Id: The Unix Heritage Society mailing list <tuhs.tuhs.org>
Archived-At: <https://www.tuhs.org/mailman3/hyperkitty/list/tuhs@tuhs.org/message/43TVM5XSMWT7PCVK5GKLVCMP3Y5IXWHY/>
List-Archive: <https://www.tuhs.org/mailman3/hyperkitty/list/tuhs@tuhs.org/>
List-Help: <mailto:tuhs-request@tuhs.org?subject=help>
List-Owner: <mailto:tuhs-owner@tuhs.org>
List-Post: <mailto:tuhs@tuhs.org>
List-Subscribe: <mailto:tuhs-join@tuhs.org>
List-Unsubscribe: <mailto:tuhs-leave@tuhs.org>
From: segaloco via TUHS <tuhs@tuhs.org>
Reply-To: segaloco <segaloco@protonmail.com>

I've often pondered on the storage differential that non-ASCII languages ra=
ck up.

Let's say one primarily stores documents in Japanese.  This puts you up in =
2-bytes-per-character range.  If you go simply by character count, the same=
 amount of characters take up twice the amount of actual bytes.  Of course,=
 Japanese isn't the greatest case for this being a problem as like many oth=
er non-phonetic scripts (and even with kana syllables) it takes less actual=
 characters to convey a thought, cutting the character count for a complete=
 sentence, even katakana/Hepburn stuff, in at least half.  All in all, they=
 may break even or even better given multi-syllable kanji.  A better exampl=
e of scripts that would likely suffer data bloat would be Hebrew or Arabic,=
 although being abjads with diacritics to represent vowel sounds, you likew=
ise land somewhere like Japanese kana where a single glyph represents what =
in the Latin alphabet would be at least two letters.  I would imagine Cyril=
lic users for instance do actually have to take the storage hit involved si=
nce their entire script is outside ASCII *and* the language is a full alpha=
bet and not an abjad nor logographic.  Can't say I've worked with much Cyri=
llic text though.  That's not even to mention scripts where diacritics may =
be represented by a separate individual code-plane entry requiring combinat=
ion with another.

This is of course, way off list, so I don't want to start a whole side-chai=
n on it, but linguistic storage in computers has interested me for a long t=
ime, especially in my reverse engineering research of old games looking at =
how different studios implemented various code-pages for non-ASCII scripts.=
  For example, I've seen plenty of older (8/16-bit) Japanese games that obv=
iously don't use UTF-8 due to overhead in constrained console environments =
(or even being older than UTF-8) but also don't use ShiftJIS or other known=
 encodings, instead opting towards their own custom code-plane to map bytes=
, usually to kana, although I haven't really peeked into any engines that u=
se kanji.  This was uncommon as video games were typically marketed towards=
 children who weren't expected to know enough kanji to read complicated tex=
t.  You see the same today with text associated with children's media in Ja=
pan in that hiragana syllabilary for a given kanji is displayed adjacent to=
 it (furigana).

I think one resounding conclusion of this thread though is we all owe Rob a=
nd Ken (and colleagues) a great deal for nailing this matter down in such a=
 well-engineered way.  Long live UTF-8!

- Matt G.

------- Original Message -------
On Wednesday, March 22nd, 2023 at 3:33 PM, Steffen Nurpmeso <steffen@sdaode=
n.eu> wrote:


> Rob Pike wrote in
> CAKzdPgwYPxK9oYemG5-vPgRR7mSfj_qkjD5-iJnLffP-23PUaQ@mail.gmail.com:
>=20
> |The appendix version named it plain UTF, repurposing the extant name to =
the
> |new encoding. The -8 came later, as it is in these linked documents,
> |because some people wanted a UTF-7 and a UTF-16. Those people should be
> |punished.
>=20
> I agree, but please with a but.
>=20
> For one especially so since UTF-7 (that i like) then didn't make
> it all through, but only here and there.
> Ie, if it would have been used for anything mail and DNS related
> to keep 7-bit compat. Instead they introduced monstrosities like
> IDNA for DNS, mUTF-7 (locale charset -> UTF-16BE -> mUTF-7) etc.
>=20
>=20
> That i hated: IDNA. If they would have said we give up on
> backward compatibility around Y2K, and the old stuff grows out;
> and 255 bytes UTF-8 is surely enough for domain names for some
> time (even percent encoded) even for those encodings which need
> four byte for one codepoint, and it simply does not work before.
> Like so they introduced those backward incompatibilities that they
> wanted to avoid.
>=20
> I did oppose strongly in the past, but UTF-16 has merits for some
> languages as well as for coding, even though you have to be able
> to deal with surrogates, .. and with grapheme boundaries, if you
> are doing it right, so 1:many is there anyhow. I mean, wchar_t is
> often 32-bit, and then not even UTF-32, at least possibly. But
> still you have the 1:many, so it buys you nothing.
> All-UTF-8 is of course great imho. (Asian people may disagree.)
>=20
> --steffen
> |
> |Der Kragenbaer, The moon bear,
> |der holt sich munter he cheerfully and one by one
> |einen nach dem anderen runter wa.ks himself off
> |(By Robert Gernhardt)