The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
From: segaloco via TUHS <tuhs@tuhs.org>
To: Steffen Nurpmeso <steffen@sdaoden.eu>
Cc: tuhs@tuhs.org
Subject: [TUHS] Re: Bell Foreign-Language UNIX Efforts
Date: Wed, 22 Mar 2023 23:33:46 +0000	[thread overview]
Message-ID: <6CpABuCD3_HRWqKPC0cMvF19rgKZ0dztDXADlAsHnwpanC5-1pKJB7lscbhJOwKtjLLjypoW7e-Gp51Mc4UKizn1pGl4FHcgu2KjPhpJ2k0=@protonmail.com> (raw)
In-Reply-To: <20230322223307.S67m0%steffen@sdaoden.eu>

I've often pondered on the storage differential that non-ASCII languages rack up.

Let's say one primarily stores documents in Japanese.  This puts you up in 2-bytes-per-character range.  If you go simply by character count, the same amount of characters take up twice the amount of actual bytes.  Of course, Japanese isn't the greatest case for this being a problem as like many other non-phonetic scripts (and even with kana syllables) it takes less actual characters to convey a thought, cutting the character count for a complete sentence, even katakana/Hepburn stuff, in at least half.  All in all, they may break even or even better given multi-syllable kanji.  A better example of scripts that would likely suffer data bloat would be Hebrew or Arabic, although being abjads with diacritics to represent vowel sounds, you likewise land somewhere like Japanese kana where a single glyph represents what in the Latin alphabet would be at least two letters.  I would imagine Cyrillic users for instance do actually have to take the storage hit involved since their entire script is outside ASCII *and* the language is a full alphabet and not an abjad nor logographic.  Can't say I've worked with much Cyrillic text though.  That's not even to mention scripts where diacritics may be represented by a separate individual code-plane entry requiring combination with another.

This is of course, way off list, so I don't want to start a whole side-chain on it, but linguistic storage in computers has interested me for a long time, especially in my reverse engineering research of old games looking at how different studios implemented various code-pages for non-ASCII scripts.  For example, I've seen plenty of older (8/16-bit) Japanese games that obviously don't use UTF-8 due to overhead in constrained console environments (or even being older than UTF-8) but also don't use ShiftJIS or other known encodings, instead opting towards their own custom code-plane to map bytes, usually to kana, although I haven't really peeked into any engines that use kanji.  This was uncommon as video games were typically marketed towards children who weren't expected to know enough kanji to read complicated text.  You see the same today with text associated with children's media in Japan in that hiragana syllabilary for a given kanji is displayed adjacent to it (furigana).

I think one resounding conclusion of this thread though is we all owe Rob and Ken (and colleagues) a great deal for nailing this matter down in such a well-engineered way.  Long live UTF-8!

- Matt G.

------- Original Message -------
On Wednesday, March 22nd, 2023 at 3:33 PM, Steffen Nurpmeso <steffen@sdaoden.eu> wrote:


> Rob Pike wrote in
> CAKzdPgwYPxK9oYemG5-vPgRR7mSfj_qkjD5-iJnLffP-23PUaQ@mail.gmail.com:
> 
> |The appendix version named it plain UTF, repurposing the extant name to the
> |new encoding. The -8 came later, as it is in these linked documents,
> |because some people wanted a UTF-7 and a UTF-16. Those people should be
> |punished.
> 
> I agree, but please with a but.
> 
> For one especially so since UTF-7 (that i like) then didn't make
> it all through, but only here and there.
> Ie, if it would have been used for anything mail and DNS related
> to keep 7-bit compat. Instead they introduced monstrosities like
> IDNA for DNS, mUTF-7 (locale charset -> UTF-16BE -> mUTF-7) etc.
> 
> 
> That i hated: IDNA. If they would have said we give up on
> backward compatibility around Y2K, and the old stuff grows out;
> and 255 bytes UTF-8 is surely enough for domain names for some
> time (even percent encoded) even for those encodings which need
> four byte for one codepoint, and it simply does not work before.
> Like so they introduced those backward incompatibilities that they
> wanted to avoid.
> 
> I did oppose strongly in the past, but UTF-16 has merits for some
> languages as well as for coding, even though you have to be able
> to deal with surrogates, .. and with grapheme boundaries, if you
> are doing it right, so 1:many is there anyhow. I mean, wchar_t is
> often 32-bit, and then not even UTF-32, at least possibly. But
> still you have the 1:many, so it buys you nothing.
> All-UTF-8 is of course great imho. (Asian people may disagree.)
> 
> --steffen
> |
> |Der Kragenbaer, The moon bear,
> |der holt sich munter he cheerfully and one by one
> |einen nach dem anderen runter wa.ks himself off
> |(By Robert Gernhardt)

  reply	other threads:[~2023-03-22 23:34 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-19  5:00 [TUHS] " segaloco via TUHS
2023-03-19 13:32 ` [TUHS] " Diomidis Spinellis
2023-03-19 13:47   ` [TUHS] " Ralph Corderoy
2023-03-19 20:27     ` [TUHS] " Rob Pike
2023-03-20  7:55       ` arnold
2023-03-20  9:22         ` Rob Pike
2023-03-20 11:02           ` arnold
2023-03-20 15:44         ` Steffen Nurpmeso
2023-03-20 22:01           ` John Cowan
2023-03-20 22:28             ` Steffen Nurpmeso
2023-03-22  2:25       ` Larry McVoy
2023-03-22  2:52         ` Rob Pike
2023-03-22  7:12           ` Mehdi Sadeghi via TUHS
2023-03-22  7:33             ` Rob Pike
2023-03-22  7:40               ` arnold
2023-03-22 10:02                 ` Skip Tavakkolian
2023-03-22 10:09                   ` Skip Tavakkolian
2023-03-22 12:02                     ` Rob Pike
2023-03-22 22:33                       ` Steffen Nurpmeso
2023-03-22 23:33                         ` segaloco via TUHS [this message]
2023-03-23  0:01                           ` Warren Toomey via TUHS
2023-03-19 13:38 ` Edouard Klein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='6CpABuCD3_HRWqKPC0cMvF19rgKZ0dztDXADlAsHnwpanC5-1pKJB7lscbhJOwKtjLLjypoW7e-Gp51Mc4UKizn1pGl4FHcgu2KjPhpJ2k0=@protonmail.com' \
    --to=tuhs@tuhs.org \
    --cc=segaloco@protonmail.com \
    --cc=steffen@sdaoden.eu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).