The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
From: Diomidis Spinellis <dds@aueb.gr>
To: The Eunuchs Hysterical Society <tuhs@tuhs.org>
Cc: segaloco <segaloco@protonmail.com>
Subject: [TUHS] Re: Bell Foreign-Language UNIX Efforts
Date: Sun, 19 Mar 2023 15:32:23 +0200	[thread overview]
Message-ID: <d90865c0-7c1c-c726-83c2-a7114e31bf19@aueb.gr> (raw)
In-Reply-To: <Y8sUnihzhzTBOuKMJUnuV0DUZEqHb223xyoxXTmq-eMAe4HFZLgce38hxypW1K9UozOjAJxyXIpwzsWCfnZCRXTXictF--9hPEM__lviJ9A=@protonmail.com>

On 19-Mar-23 7:00, segaloco via TUHS wrote:
> Good evening or whichever time of day you find yourself in.  I was reading up on Japanese computer history when I got to thinking specifically on where UNIX plays in with it all, which then lead to some further curiosity with non-English UNIX in general.
> 
> In the midst of documentation searches/study, I've spotted French and what I believe to be Japanese documentation bearing Bell/AT&T logos.  I've also seen a few things pop up in German although they looked to be university resources, not something from the Bell System.  In any case, is there any clear historical record on efforts within the USG/USL line, or research for that matter, towards the end of foreign language support or perhaps even single polyglot installations?  Would BSD have been more poised for this sort of thing being more widely utilized in the academic scene?

I think the most significant development that came out of Unix regarding 
internationalization was the proposal and adoption of Unicode and UTF-8. 
  This was published in 1993 in the USENIX Technical Conference proceedings:

Pike, Rob, and Ken Thompson. "Hello World or Καλημέρα κόσμε or こんにちは世界." 
Proceedings of the Winter 1993 USENIX Conference. 1993.

At the time of the decision to adopt Unicode and UTF-8 in Unix (Plan 9 
actually) there was no consensus on international character 
representations and encodings.  Many systems extended ASCII with 8 bit 
characters to represent those required in a particular country  These 
"code pages" were standardized in numerous mutually incompatible 
ISO-8859-X variants.  My understanding is the for many Asian (Chinese, 
Japanese, and Korean) languages the situation was even worse, with ISO 
2022 being used to shift mid-string from one character set encoding to 
another.

In addition, Unicode was a draft standard for unified 16-bit character 
codes promoted by a group US companies.  It was battling against the ISO 
10646 draft, which had taken the approach of allocating character set 
blocks to national bodies, thus creating a sparse 32-bit representation 
with considerable redundancy between similar languages.  Furthermore, 
the ISO 10646 standard proposed a (non-required) UTF multibyte encoding 
(now known as UTF-1), which was not self-synchronized, because bytes 
used for representing ASCII characters were also employed as parts of 
multibyte sequences.

The Bell Labs team took the bold approach of adopting the draft Unicode 
standard and an X-Open proposal for encoding multibyte characters only 
using bytes with the top bit set.  At the time the encoding was known as 
UTF-2; it is what we now call UTF-8.  UTF-8 makes it easier to achieve 
backward compatibility in existing code; for example code scanning for 
the "/" file path separation character in a string, will never encounter 
it in the UTF-8 representation of non-ASCII characters.

The Plan 9 choices proved wise and prescient.  I do not know how much 
the Plan 9 implementation and the USENIX paper influenced further 
developments (its authors may enlighten us), but in the end Unicode 
converged with ISO 10646 becoming a single standard, and UTF-8 was 
widely adopted.

The Plan 9 team's decision to adopt UTF-8 was by no means a given. 
Consider the case of Microsoft, which released Windows NT with Unicode 
support in the same year.  Microsoft's Windows NT 1993 offering 
supported a wide character encoding, not UTF-8: initially UCS-2 and 
later UTF-16.  To achieve backward compatibility the Windows API offers 
two functions for each call involving strings: a so-called "ANSI" 
version (actually using the currently active code page) and a "Wide" 
(Unicode) version.  Furthermore, text files use a byte order mark to 
inform programs regarding their character representation, and in C/C++ 
code strings are often enclosed in a special macro to facilitate porting 
to wide characters.  In the end, in 2019 Microsoft yielded, supporting 
UTF-8 in its Windows API through code page 65001 (CP_UTF8), and 
recommending its use.  The double APIs and BOM files are still with us 
as a reminder that deficient technical decisions come at a cost.


Diomidis - https://www.spinellis.gr

  reply	other threads:[~2023-03-19 13:32 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-19  5:00 [TUHS] " segaloco via TUHS
2023-03-19 13:32 ` Diomidis Spinellis [this message]
2023-03-19 13:47   ` Ralph Corderoy
2023-03-19 20:27     ` [TUHS] " Rob Pike
2023-03-20  7:55       ` arnold
2023-03-20  9:22         ` Rob Pike
2023-03-20 11:02           ` arnold
2023-03-20 15:44         ` Steffen Nurpmeso
2023-03-20 22:01           ` John Cowan
2023-03-20 22:28             ` Steffen Nurpmeso
2023-03-22  2:25       ` Larry McVoy
2023-03-22  2:52         ` Rob Pike
2023-03-22  7:12           ` Mehdi Sadeghi via TUHS
2023-03-22  7:33             ` Rob Pike
2023-03-22  7:40               ` arnold
2023-03-22 10:02                 ` Skip Tavakkolian
2023-03-22 10:09                   ` Skip Tavakkolian
2023-03-22 12:02                     ` Rob Pike
2023-03-22 22:33                       ` Steffen Nurpmeso
2023-03-22 23:33                         ` segaloco via TUHS
2023-03-23  0:01                           ` Warren Toomey via TUHS
2023-03-19 13:38 ` Edouard Klein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d90865c0-7c1c-c726-83c2-a7114e31bf19@aueb.gr \
    --to=dds@aueb.gr \
    --cc=segaloco@protonmail.com \
    --cc=tuhs@tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).