discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed
* distinct character, same glyph in Unicode
       [not found]       ` <20110317012423.GE12736@iris.usta.de>
@ 2011-03-17 19:01         ` Ingo Schwarze
  2011-03-17 22:00           ` Thomas Klausner
  0 siblings, 1 reply; 2+ messages in thread
From: Ingo Schwarze @ 2011-03-17 19:01 UTC (permalink / raw)
  To: stsp; +Cc: discuss

Hi Stephan,

i'm looking for information about issues with distinct characters
having the same glyphs in their standard representation, in particular
when there are modes of alternative representation where they need to
be represented by different glyphs or different transliterations.
As you are working on Unicode character sets, do you have any
pointers?

In particular,

  http://std.dkuug.dk/CEN/tc304/guide/gucsch06.htm

clearly states that "distinct characters may have the same glyph".
That guide also cites the ISO/IEC 10646-1:1993 definition of
a "character":

  A member of a set of elements used for the organisation, control,
  or representation of data.

That definition is further explained:

  One cannot just look at the characters, in a visual representation,
  since distinct characters may have the same glyph.
  The question should be whether they are both used in the same way
  in the organisation, control or representation of data.

Let's take a very simple example:
The characters represented by glyphs with two superimposed dots.

In French, the two dots are called "tr'ema".  What they *represent* -
see the "representation of data" in the above explanation - is that
the vowel with the diacritic is to be pronounced *separately from
the preceding vowel*, but the vowel itself is unchanged.
For example, consider les mots "ast'ero:ide" and "Citro:en".

In German, the two dots are called "umlaut".  What they represent
is that the pronounciation of the vowel with the diacritic is
*modified*; there is no special relationship to an adjacent vowel,
if any; consider die Worte "H:auser", "B:oen", "ge:offnet".

Thus, the French ":o" (o diaeresis) - note that the Grand Robert
explicitely lists this one as occurring in natively French words -
is clearly not the same character as the German ":o" (o umlaut).
Due to the limited number of characters in ISO-latin-1, i'm not
surprised that this distinction was blurred - however, do you know
why there is apparently no distinction in Unicode either?
All sources i can find recommend using the "diaeresis" class of
characters for German umlauts.

The resulting problem is that in ASCII transliteration - or in any
font lacking glyphs representing diacritics - French and German ":o"
require different transliteration, which i regard as another clear
indication that they are different characters.  French ":o" must
be transliterated to "o" ("asteroide" is acceptable, whereas
"asteroiede" would be absurdly wrong), while German ":o" must be
transliterated as "oe" ("geoeffnet" is correct, while "geoffnet"
would just be plain wrong).

Even though i don't know Chinese or Japanese, the situation over there
seems even worse, see

  http://unicode.org/faq/han_cjk.html#3

  "Q: If the character shapes are different in different parts of
      East Asia, why were the characters unified?

   A: The Unicode standard is designed to encode characters, not glyphs.
      Even where there are substantial variations in the standard way of
      writing a character from locale to locale, if the fundamental
      identity of the character is not in question, then a single
      character is encoded in Unicode."

But what is not answered is why the characters are considered the same.
Apparently, we have quite *different* characters used in quite different
ways to represent completely different data, in particular
 - different meaning
 - different pronounciation
 - used in completely different contexts
 - and even represented by slightly different glyphs
and still they are clobbered into the same Unicode character,
making it fundamentally impossible, if i understand correctly,
to generate a transliteration from a CJK Unicode text without
additional assumptions (e.g., that all characters containained
are intended to be Japanese and not Chinese characters).

Isn't that a fundamental contradiction to the definition of
a "character" and a flaw in the design of Unicode?
Or in case it is not and i'm just confused, which is the
standard way to represent a character in a way that allows
transliteration, should the necessity arise?

This is relevant to mandoc because whatever solution to such problems
one might consider for mandoc_char(7) or in general roff(7) encodings
should at least be compatible with Unicode in the output, but it looks
like Unicode is an inadequate tool even for the simple task of
distinguishing different characters from each other...  :-(

Thanks,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: distinct character, same glyph in Unicode
  2011-03-17 19:01         ` distinct character, same glyph in Unicode Ingo Schwarze
@ 2011-03-17 22:00           ` Thomas Klausner
  0 siblings, 0 replies; 2+ messages in thread
From: Thomas Klausner @ 2011-03-17 22:00 UTC (permalink / raw)
  To: discuss

On Thu, Mar 17, 2011 at 08:01:50PM +0100, Ingo Schwarze wrote:
> Thus, the French ":o" (o diaeresis) - note that the Grand Robert
> explicitely lists this one as occurring in natively French words -
> is clearly not the same character as the German ":o" (o umlaut).
> Due to the limited number of characters in ISO-latin-1, i'm not
> surprised that this distinction was blurred - however, do you know
> why there is apparently no distinction in Unicode either?
> All sources i can find recommend using the "diaeresis" class of
> characters for German umlauts.

Sometimes, there's a need to distinguish between the umlaut sign and
the diaeresis sign. In these cases, the following recommendation by
ISO/IEC JTC 1/SC 2/WG 2 should be followed:

    * To represent the umlaut use Combining Diaeresis (U+0308)
    * To represent the diaeresis use Combining Grapheme Joiner (CGJ,
    U+034F) + Combining Diaeresis (U+0308)

From http://en.wikipedia.org/wiki/Umlaut_%28diacritic%29 and
originally http://unicode.org/faq/char_combmark.html#18

 Thomas
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2011-03-17 22:00 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20110316211219.GA15999@harkle.bramka>
     [not found] ` <20110316222835.GB15999@harkle.bramka>
     [not found]   ` <20110316223702.GB12736@iris.usta.de>
     [not found]     ` <20110317000813.GC15999@harkle.bramka>
     [not found]       ` <20110317012423.GE12736@iris.usta.de>
2011-03-17 19:01         ` distinct character, same glyph in Unicode Ingo Schwarze
2011-03-17 22:00           ` Thomas Klausner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).