From: Diomidis Spinellis <dds@aueb.gr>
To: The Eunuchs Hysterical Society <tuhs@tuhs.org>
Cc: segaloco <segaloco@protonmail.com>
Subject: [TUHS] Re: Bell Foreign-Language UNIX Efforts
Date: Sun, 19 Mar 2023 15:32:23 +0200 [thread overview]
Message-ID: <d90865c0-7c1c-c726-83c2-a7114e31bf19@aueb.gr> (raw)
In-Reply-To: <Y8sUnihzhzTBOuKMJUnuV0DUZEqHb223xyoxXTmq-eMAe4HFZLgce38hxypW1K9UozOjAJxyXIpwzsWCfnZCRXTXictF--9hPEM__lviJ9A=@protonmail.com>
On 19-Mar-23 7:00, segaloco via TUHS wrote:
> Good evening or whichever time of day you find yourself in. I was reading up on Japanese computer history when I got to thinking specifically on where UNIX plays in with it all, which then lead to some further curiosity with non-English UNIX in general.
>
> In the midst of documentation searches/study, I've spotted French and what I believe to be Japanese documentation bearing Bell/AT&T logos. I've also seen a few things pop up in German although they looked to be university resources, not something from the Bell System. In any case, is there any clear historical record on efforts within the USG/USL line, or research for that matter, towards the end of foreign language support or perhaps even single polyglot installations? Would BSD have been more poised for this sort of thing being more widely utilized in the academic scene?
I think the most significant development that came out of Unix regarding
internationalization was the proposal and adoption of Unicode and UTF-8.
This was published in 1993 in the USENIX Technical Conference proceedings:
Pike, Rob, and Ken Thompson. "Hello World or Καλημέρα κόσμε or こんにちは世界."
Proceedings of the Winter 1993 USENIX Conference. 1993.
At the time of the decision to adopt Unicode and UTF-8 in Unix (Plan 9
actually) there was no consensus on international character
representations and encodings. Many systems extended ASCII with 8 bit
characters to represent those required in a particular country These
"code pages" were standardized in numerous mutually incompatible
ISO-8859-X variants. My understanding is the for many Asian (Chinese,
Japanese, and Korean) languages the situation was even worse, with ISO
2022 being used to shift mid-string from one character set encoding to
another.
In addition, Unicode was a draft standard for unified 16-bit character
codes promoted by a group US companies. It was battling against the ISO
10646 draft, which had taken the approach of allocating character set
blocks to national bodies, thus creating a sparse 32-bit representation
with considerable redundancy between similar languages. Furthermore,
the ISO 10646 standard proposed a (non-required) UTF multibyte encoding
(now known as UTF-1), which was not self-synchronized, because bytes
used for representing ASCII characters were also employed as parts of
multibyte sequences.
The Bell Labs team took the bold approach of adopting the draft Unicode
standard and an X-Open proposal for encoding multibyte characters only
using bytes with the top bit set. At the time the encoding was known as
UTF-2; it is what we now call UTF-8. UTF-8 makes it easier to achieve
backward compatibility in existing code; for example code scanning for
the "/" file path separation character in a string, will never encounter
it in the UTF-8 representation of non-ASCII characters.
The Plan 9 choices proved wise and prescient. I do not know how much
the Plan 9 implementation and the USENIX paper influenced further
developments (its authors may enlighten us), but in the end Unicode
converged with ISO 10646 becoming a single standard, and UTF-8 was
widely adopted.
The Plan 9 team's decision to adopt UTF-8 was by no means a given.
Consider the case of Microsoft, which released Windows NT with Unicode
support in the same year. Microsoft's Windows NT 1993 offering
supported a wide character encoding, not UTF-8: initially UCS-2 and
later UTF-16. To achieve backward compatibility the Windows API offers
two functions for each call involving strings: a so-called "ANSI"
version (actually using the currently active code page) and a "Wide"
(Unicode) version. Furthermore, text files use a byte order mark to
inform programs regarding their character representation, and in C/C++
code strings are often enclosed in a special macro to facilitate porting
to wide characters. In the end, in 2019 Microsoft yielded, supporting
UTF-8 in its Windows API through code page 65001 (CP_UTF8), and
recommending its use. The double APIs and BOM files are still with us
as a reminder that deficient technical decisions come at a cost.
Diomidis - https://www.spinellis.gr
next prev parent reply other threads:[~2023-03-19 13:32 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-19 5:00 [TUHS] " segaloco via TUHS
2023-03-19 13:32 ` Diomidis Spinellis [this message]
2023-03-19 13:47 ` Ralph Corderoy
2023-03-19 20:27 ` [TUHS] " Rob Pike
2023-03-20 7:55 ` arnold
2023-03-20 9:22 ` Rob Pike
2023-03-20 11:02 ` arnold
2023-03-20 15:44 ` Steffen Nurpmeso
2023-03-20 22:01 ` John Cowan
2023-03-20 22:28 ` Steffen Nurpmeso
2023-03-22 2:25 ` Larry McVoy
2023-03-22 2:52 ` Rob Pike
2023-03-22 7:12 ` Mehdi Sadeghi via TUHS
2023-03-22 7:33 ` Rob Pike
2023-03-22 7:40 ` arnold
2023-03-22 10:02 ` Skip Tavakkolian
2023-03-22 10:09 ` Skip Tavakkolian
2023-03-22 12:02 ` Rob Pike
2023-03-22 22:33 ` Steffen Nurpmeso
2023-03-22 23:33 ` segaloco via TUHS
2023-03-23 0:01 ` Warren Toomey via TUHS
2023-03-19 13:38 ` Edouard Klein
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d90865c0-7c1c-c726-83c2-a7114e31bf19@aueb.gr \
--to=dds@aueb.gr \
--cc=segaloco@protonmail.com \
--cc=tuhs@tuhs.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).