From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 8654 invoked from network); 22 Mar 2023 23:34:11 -0000 Received: from minnie.tuhs.org (50.116.15.146) by inbox.vuxu.org with ESMTPUTF8; 22 Mar 2023 23:34:11 -0000 Received: from minnie.tuhs.org (localhost [IPv6:::1]) by minnie.tuhs.org (Postfix) with ESMTP id F0ABA41391; Thu, 23 Mar 2023 09:34:04 +1000 (AEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tuhs.org; s=dkim; t=1679528045; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-owner:list-unsubscribe:list-subscribe:list-post; bh=zi+d8A4EJVY1gAiY51Ab/zTE+td75n1/VsKJhYUyodo=; b=DkxjKmrURCbDSDhIH+zVi1+iegjLfrodCaTVy6beS/prrPEdsZaRgxRyfWVafr8Xi6oh5e t+e+vEjoADvHyjMpFxIxbkaZYqG/z00RWoJKZeP7LBgsgNJe114spizjdF2XBZY5YfM36P vrcUEUOQkkk3Wwlf9rZC9qy5slm+KHc= Received: from mail-4319.protonmail.ch (mail-4319.protonmail.ch [185.70.43.19]) by minnie.tuhs.org (Postfix) with ESMTPS id BFAF34138E for ; Thu, 23 Mar 2023 09:33:59 +1000 (AEST) Date: Wed, 22 Mar 2023 23:33:46 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=protonmail.com; s=protonmail3; t=1679528037; x=1679787237; bh=zi+d8A4EJVY1gAiY51Ab/zTE+td75n1/VsKJhYUyodo=; h=Date:To:From:Cc:Subject:Message-ID:In-Reply-To:References: Feedback-ID:From:To:Cc:Date:Subject:Reply-To:Feedback-ID: Message-ID:BIMI-Selector; b=vfjAAPThvNms69Ub2yxtHXZStNiJjTJl2SCKQuHOd6kAGcJC4kIFKGQu7t3c+ax+E wWgtHReLqP/xcQ58Ely5IlePIZT7PpI7Kby1Y0ekldrTQU9TWxDwDPSoYvrOKyLsfh Lzsc2RHVPzaakFWMZfO3OoK7uP71RH/Blj4nyOCO4+FCzlFHQVtUa7y5l7owLHBX2x 1EqytNid0POUzHiUwu6dJns2Ph4OGJjzp0qD79DM5/Rhh8OmZBibgs7dQOu4mWbljW oc1ro5x42p+cqqpZOZ/Mm6hmWiKNNjPEDlnRd2XmYsQ393NMdv8yeJauHdSQjEYztd rAMED6NDeUpaA== To: Steffen Nurpmeso Message-ID: <6CpABuCD3_HRWqKPC0cMvF19rgKZ0dztDXADlAsHnwpanC5-1pKJB7lscbhJOwKtjLLjypoW7e-Gp51Mc4UKizn1pGl4FHcgu2KjPhpJ2k0=@protonmail.com> In-Reply-To: <20230322223307.S67m0%steffen@sdaoden.eu> References: <202303220740.32M7eprr032005@freefriends.org> <20230322223307.S67m0%steffen@sdaoden.eu> Feedback-ID: 35591162:user:proton MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: 43TVM5XSMWT7PCVK5GKLVCMP3Y5IXWHY X-Message-ID-Hash: 43TVM5XSMWT7PCVK5GKLVCMP3Y5IXWHY X-MailFrom: segaloco@protonmail.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: tuhs@tuhs.org X-Mailman-Version: 3.3.6b1 Precedence: list Subject: [TUHS] Re: Bell Foreign-Language UNIX Efforts List-Id: The Unix Heritage Society mailing list Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: segaloco via TUHS Reply-To: segaloco I've often pondered on the storage differential that non-ASCII languages ra= ck up. Let's say one primarily stores documents in Japanese. This puts you up in = 2-bytes-per-character range. If you go simply by character count, the same= amount of characters take up twice the amount of actual bytes. Of course,= Japanese isn't the greatest case for this being a problem as like many oth= er non-phonetic scripts (and even with kana syllables) it takes less actual= characters to convey a thought, cutting the character count for a complete= sentence, even katakana/Hepburn stuff, in at least half. All in all, they= may break even or even better given multi-syllable kanji. A better exampl= e of scripts that would likely suffer data bloat would be Hebrew or Arabic,= although being abjads with diacritics to represent vowel sounds, you likew= ise land somewhere like Japanese kana where a single glyph represents what = in the Latin alphabet would be at least two letters. I would imagine Cyril= lic users for instance do actually have to take the storage hit involved si= nce their entire script is outside ASCII *and* the language is a full alpha= bet and not an abjad nor logographic. Can't say I've worked with much Cyri= llic text though. That's not even to mention scripts where diacritics may = be represented by a separate individual code-plane entry requiring combinat= ion with another. This is of course, way off list, so I don't want to start a whole side-chai= n on it, but linguistic storage in computers has interested me for a long t= ime, especially in my reverse engineering research of old games looking at = how different studios implemented various code-pages for non-ASCII scripts.= For example, I've seen plenty of older (8/16-bit) Japanese games that obv= iously don't use UTF-8 due to overhead in constrained console environments = (or even being older than UTF-8) but also don't use ShiftJIS or other known= encodings, instead opting towards their own custom code-plane to map bytes= , usually to kana, although I haven't really peeked into any engines that u= se kanji. This was uncommon as video games were typically marketed towards= children who weren't expected to know enough kanji to read complicated tex= t. You see the same today with text associated with children's media in Ja= pan in that hiragana syllabilary for a given kanji is displayed adjacent to= it (furigana). I think one resounding conclusion of this thread though is we all owe Rob a= nd Ken (and colleagues) a great deal for nailing this matter down in such a= well-engineered way. Long live UTF-8! - Matt G. ------- Original Message ------- On Wednesday, March 22nd, 2023 at 3:33 PM, Steffen Nurpmeso wrote: > Rob Pike wrote in > CAKzdPgwYPxK9oYemG5-vPgRR7mSfj_qkjD5-iJnLffP-23PUaQ@mail.gmail.com: >=20 > |The appendix version named it plain UTF, repurposing the extant name to = the > |new encoding. The -8 came later, as it is in these linked documents, > |because some people wanted a UTF-7 and a UTF-16. Those people should be > |punished. >=20 > I agree, but please with a but. >=20 > For one especially so since UTF-7 (that i like) then didn't make > it all through, but only here and there. > Ie, if it would have been used for anything mail and DNS related > to keep 7-bit compat. Instead they introduced monstrosities like > IDNA for DNS, mUTF-7 (locale charset -> UTF-16BE -> mUTF-7) etc. >=20 >=20 > That i hated: IDNA. If they would have said we give up on > backward compatibility around Y2K, and the old stuff grows out; > and 255 bytes UTF-8 is surely enough for domain names for some > time (even percent encoded) even for those encodings which need > four byte for one codepoint, and it simply does not work before. > Like so they introduced those backward incompatibilities that they > wanted to avoid. >=20 > I did oppose strongly in the past, but UTF-16 has merits for some > languages as well as for coding, even though you have to be able > to deal with surrogates, .. and with grapheme boundaries, if you > are doing it right, so 1:many is there anyhow. I mean, wchar_t is > often 32-bit, and then not even UTF-32, at least possibly. But > still you have the 1:many, so it buys you nothing. > All-UTF-8 is of course great imho. (Asian people may disagree.) >=20 > --steffen > | > |Der Kragenbaer, The moon bear, > |der holt sich munter he cheerfully and one by one > |einen nach dem anderen runter wa.ks himself off > |(By Robert Gernhardt)