From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from scc-mailout.scc.kit.edu (scc-mailout.scc.kit.edu [129.13.185.202]) by krisdoz.my.domain (8.14.3/8.14.3) with ESMTP id p2HJ1vuB027886 for ; Thu, 17 Mar 2011 15:02:02 -0400 (EDT) Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82]) by scc-mailout-02.scc.kit.edu with esmtp (Exim 4.72 #1) id 1Q0ISF-0001Gc-5P; Thu, 17 Mar 2011 20:01:54 +0100 Received: from donnerwolke.usta.de ([172.24.96.3]) by hekate.usta.de with esmtp (Exim 4.72) (envelope-from ) id 1Q0ISF-0003mF-59; Thu, 17 Mar 2011 20:01:51 +0100 Received: from iris.usta.de ([172.24.96.5] helo=usta.de) by donnerwolke.usta.de with esmtp (Exim 4.69) (envelope-from ) id 1Q0ISF-0000gg-3i; Thu, 17 Mar 2011 20:01:51 +0100 Received: from schwarze by usta.de with local (Exim 4.72) (envelope-from ) id 1Q0ISE-0008Ch-SC; Thu, 17 Mar 2011 20:01:50 +0100 Date: Thu, 17 Mar 2011 20:01:50 +0100 From: Ingo Schwarze To: stsp@openbsd.org Cc: discuss@mdocml.bsd.lv Subject: distinct character, same glyph in Unicode Message-ID: <20110317190150.GA29893@iris.usta.de> References: <20110316211219.GA15999@harkle.bramka> <20110316222835.GB15999@harkle.bramka> <20110316223702.GB12736@iris.usta.de> <20110317000813.GC15999@harkle.bramka> <20110317012423.GE12736@iris.usta.de> X-Mailinglist: mdocml-discuss Reply-To: discuss@mdocml.bsd.lv MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110317012423.GE12736@iris.usta.de> User-Agent: Mutt/1.5.21 (2010-09-15) Hi Stephan, i'm looking for information about issues with distinct characters having the same glyphs in their standard representation, in particular when there are modes of alternative representation where they need to be represented by different glyphs or different transliterations. As you are working on Unicode character sets, do you have any pointers? In particular, http://std.dkuug.dk/CEN/tc304/guide/gucsch06.htm clearly states that "distinct characters may have the same glyph". That guide also cites the ISO/IEC 10646-1:1993 definition of a "character": A member of a set of elements used for the organisation, control, or representation of data. That definition is further explained: One cannot just look at the characters, in a visual representation, since distinct characters may have the same glyph. The question should be whether they are both used in the same way in the organisation, control or representation of data. Let's take a very simple example: The characters represented by glyphs with two superimposed dots. In French, the two dots are called "tr'ema". What they *represent* - see the "representation of data" in the above explanation - is that the vowel with the diacritic is to be pronounced *separately from the preceding vowel*, but the vowel itself is unchanged. For example, consider les mots "ast'ero:ide" and "Citro:en". In German, the two dots are called "umlaut". What they represent is that the pronounciation of the vowel with the diacritic is *modified*; there is no special relationship to an adjacent vowel, if any; consider die Worte "H:auser", "B:oen", "ge:offnet". Thus, the French ":o" (o diaeresis) - note that the Grand Robert explicitely lists this one as occurring in natively French words - is clearly not the same character as the German ":o" (o umlaut). Due to the limited number of characters in ISO-latin-1, i'm not surprised that this distinction was blurred - however, do you know why there is apparently no distinction in Unicode either? All sources i can find recommend using the "diaeresis" class of characters for German umlauts. The resulting problem is that in ASCII transliteration - or in any font lacking glyphs representing diacritics - French and German ":o" require different transliteration, which i regard as another clear indication that they are different characters. French ":o" must be transliterated to "o" ("asteroide" is acceptable, whereas "asteroiede" would be absurdly wrong), while German ":o" must be transliterated as "oe" ("geoeffnet" is correct, while "geoffnet" would just be plain wrong). Even though i don't know Chinese or Japanese, the situation over there seems even worse, see http://unicode.org/faq/han_cjk.html#3 "Q: If the character shapes are different in different parts of East Asia, why were the characters unified? A: The Unicode standard is designed to encode characters, not glyphs. Even where there are substantial variations in the standard way of writing a character from locale to locale, if the fundamental identity of the character is not in question, then a single character is encoded in Unicode." But what is not answered is why the characters are considered the same. Apparently, we have quite *different* characters used in quite different ways to represent completely different data, in particular - different meaning - different pronounciation - used in completely different contexts - and even represented by slightly different glyphs and still they are clobbered into the same Unicode character, making it fundamentally impossible, if i understand correctly, to generate a transliteration from a CJK Unicode text without additional assumptions (e.g., that all characters containained are intended to be Japanese and not Chinese characters). Isn't that a fundamental contradiction to the definition of a "character" and a flaw in the design of Unicode? Or in case it is not and i'm just confused, which is the standard way to represent a character in a way that allows transliteration, should the necessity arise? This is relevant to mandoc because whatever solution to such problems one might consider for mandoc_char(7) or in general roff(7) encodings should at least be compatible with Unicode in the output, but it looks like Unicode is an inadequate tool even for the simple task of distinguishing different characters from each other... :-( Thanks, Ingo -- To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv