From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from scc-mailout.scc.kit.edu (scc-mailout.scc.kit.edu [129.13.185.202])
	by krisdoz.my.domain (8.14.3/8.14.3) with ESMTP id p2HJ1vuB027886
	for <discuss@mdocml.bsd.lv>; Thu, 17 Mar 2011 15:02:02 -0400 (EDT)
Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82])
	by scc-mailout-02.scc.kit.edu with esmtp (Exim 4.72 #1)
	id 1Q0ISF-0001Gc-5P; Thu, 17 Mar 2011 20:01:54 +0100
Received: from donnerwolke.usta.de ([172.24.96.3])
	by hekate.usta.de with esmtp (Exim 4.72)
	(envelope-from <schwarze@usta.de>)
	id 1Q0ISF-0003mF-59; Thu, 17 Mar 2011 20:01:51 +0100
Received: from iris.usta.de ([172.24.96.5] helo=usta.de)
	by donnerwolke.usta.de with esmtp (Exim 4.69)
	(envelope-from <schwarze@usta.de>)
	id 1Q0ISF-0000gg-3i; Thu, 17 Mar 2011 20:01:51 +0100
Received: from schwarze by usta.de with local (Exim 4.72)
	(envelope-from <schwarze@usta.de>)
	id 1Q0ISE-0008Ch-SC; Thu, 17 Mar 2011 20:01:50 +0100
Date: Thu, 17 Mar 2011 20:01:50 +0100
From: Ingo Schwarze <schwarze@usta.de>
To: stsp@openbsd.org
Cc: discuss@mdocml.bsd.lv
Subject: distinct character, same glyph in Unicode
Message-ID: <20110317190150.GA29893@iris.usta.de>
References: <20110316211219.GA15999@harkle.bramka>
 <20110316222835.GB15999@harkle.bramka>
 <20110316223702.GB12736@iris.usta.de>
 <20110317000813.GC15999@harkle.bramka>
 <20110317012423.GE12736@iris.usta.de>
X-Mailinglist: mdocml-discuss
Reply-To: discuss@mdocml.bsd.lv
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20110317012423.GE12736@iris.usta.de>
User-Agent: Mutt/1.5.21 (2010-09-15)

Hi Stephan,

i'm looking for information about issues with distinct characters
having the same glyphs in their standard representation, in particular
when there are modes of alternative representation where they need to
be represented by different glyphs or different transliterations.
As you are working on Unicode character sets, do you have any
pointers?

In particular,

  http://std.dkuug.dk/CEN/tc304/guide/gucsch06.htm

clearly states that "distinct characters may have the same glyph".
That guide also cites the ISO/IEC 10646-1:1993 definition of
a "character":

  A member of a set of elements used for the organisation, control,
  or representation of data.

That definition is further explained:

  One cannot just look at the characters, in a visual representation,
  since distinct characters may have the same glyph.
  The question should be whether they are both used in the same way
  in the organisation, control or representation of data.

Let's take a very simple example:
The characters represented by glyphs with two superimposed dots.

In French, the two dots are called "tr'ema".  What they *represent* -
see the "representation of data" in the above explanation - is that
the vowel with the diacritic is to be pronounced *separately from
the preceding vowel*, but the vowel itself is unchanged.
For example, consider les mots "ast'ero:ide" and "Citro:en".

In German, the two dots are called "umlaut".  What they represent
is that the pronounciation of the vowel with the diacritic is
*modified*; there is no special relationship to an adjacent vowel,
if any; consider die Worte "H:auser", "B:oen", "ge:offnet".

Thus, the French ":o" (o diaeresis) - note that the Grand Robert
explicitely lists this one as occurring in natively French words -
is clearly not the same character as the German ":o" (o umlaut).
Due to the limited number of characters in ISO-latin-1, i'm not
surprised that this distinction was blurred - however, do you know
why there is apparently no distinction in Unicode either?
All sources i can find recommend using the "diaeresis" class of
characters for German umlauts.

The resulting problem is that in ASCII transliteration - or in any
font lacking glyphs representing diacritics - French and German ":o"
require different transliteration, which i regard as another clear
indication that they are different characters.  French ":o" must
be transliterated to "o" ("asteroide" is acceptable, whereas
"asteroiede" would be absurdly wrong), while German ":o" must be
transliterated as "oe" ("geoeffnet" is correct, while "geoffnet"
would just be plain wrong).

Even though i don't know Chinese or Japanese, the situation over there
seems even worse, see

  http://unicode.org/faq/han_cjk.html#3

  "Q: If the character shapes are different in different parts of
      East Asia, why were the characters unified?

   A: The Unicode standard is designed to encode characters, not glyphs.
      Even where there are substantial variations in the standard way of
      writing a character from locale to locale, if the fundamental
      identity of the character is not in question, then a single
      character is encoded in Unicode."

But what is not answered is why the characters are considered the same.
Apparently, we have quite *different* characters used in quite different
ways to represent completely different data, in particular
 - different meaning
 - different pronounciation
 - used in completely different contexts
 - and even represented by slightly different glyphs
and still they are clobbered into the same Unicode character,
making it fundamentally impossible, if i understand correctly,
to generate a transliteration from a CJK Unicode text without
additional assumptions (e.g., that all characters containained
are intended to be Japanese and not Chinese characters).

Isn't that a fundamental contradiction to the definition of
a "character" and a flaw in the design of Unicode?
Or in case it is not and i'm just confused, which is the
standard way to represent a character in a way that allows
transliteration, should the necessity arise?

This is relevant to mandoc because whatever solution to such problems
one might consider for mandoc_char(7) or in general roff(7) encodings
should at least be compatible with Unicode in the output, but it looks
like Unicode is an inadequate tool even for the simple task of
distinguishing different characters from each other...  :-(

Thanks,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv