From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp1.rz.uni-karlsruhe.de (Debian-exim@smtp1.rz.uni-karlsruhe.de [129.13.185.217])
	by krisdoz.my.domain (8.14.3/8.14.3) with ESMTP id o67LCD2X005415
	for <discuss@mdocml.bsd.lv>; Wed, 7 Jul 2010 17:12:14 -0400 (EDT)
Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82])
	by smtp1.rz.uni-karlsruhe.de with esmtp (Exim 4.63 #1)
	id 1OWbue-0007Pe-M4; Wed, 07 Jul 2010 23:12:12 +0200
Received: from donnerwolke.usta.de ([172.24.96.3])
	by hekate.usta.de with esmtp (Exim 4.71)
	(envelope-from <schwarze@usta.de>)
	id 1OWbue-0001BX-L6
	for discuss@mdocml.bsd.lv; Wed, 07 Jul 2010 23:12:12 +0200
Received: from iris.usta.de ([172.24.96.5] helo=usta.de)
	by donnerwolke.usta.de with esmtp (Exim 4.69)
	(envelope-from <schwarze@usta.de>)
	id 1OWbue-0004KC-KI
	for discuss@mdocml.bsd.lv; Wed, 07 Jul 2010 23:12:12 +0200
Received: from schwarze by usta.de with local (Exim 4.71)
	(envelope-from <schwarze@usta.de>)
	id 1OWbue-0007bN-JJ
	for discuss@mdocml.bsd.lv; Wed, 07 Jul 2010 23:12:12 +0200
Date: Wed, 7 Jul 2010 23:12:12 +0200
From: Ingo Schwarze <schwarze@usta.de>
To: discuss@mdocml.bsd.lv
Subject: Re: Raw UTF-8?
Message-ID: <20100707211212.GC19725@iris.usta.de>
References: <4c33f0f0.0c87970a.3458.fffff43f@mx.google.com>
 <20100707185815.GA19725@iris.usta.de>
 <20100707191807.GA18154@britannica.bec.de>
X-Mailinglist: mdocml-discuss
Reply-To: discuss@mdocml.bsd.lv
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20100707191807.GA18154@britannica.bec.de>
User-Agent: Mutt/1.5.20 (2009-06-14)

Hi Joerg,

Joerg Sonnenberger wrote on Wed, Jul 07, 2010 at 09:18:08PM +0200:

> Consider my name -- I would strongly hope that output devices with
> proper Latin1/Latin15/UTF-8 support to use the diacrit, but fall
> back to the transliterated version otherwise.

You hope in vain.  Did you try?

Both old and new groff render that as 'J"\borg Sonnenberger',
which looks like "Jorg Sonnenberger" on a typical terminal.

Maybe the reason for using the unreliable backspace-encoding
variant instead of the transliteration "oe" is that more languages
than just german might use the "LATIN SMALL LETTER O WITH DIAERESIS",
as Unicode calls it, and who knows how a good transliteration from
those languages into ASCII might look like?

The point is, for correct results, you must transliterate before
encoding, when you still know the context, e.g. the language,
which is often required to figure out a correct transliteration.

Thus, you should really use

.An Joerg Sonnenberger

and never

.An J\(:org Sonnenberger

when documenting your programs.


> You know that C99 just like many other modern language (dialects)
> allow full 8bit input?

I know that some do, and i have fought with Python code garbled
in that way, and all the more do i call it insane.

> The primary problem I have with using 8bit input for mandoc(1) (or groff
> in general) is that it doesn't have a way to specify the input character
> set. If that is addressed, the discussion would move to the more
> interesting point of transliteration.

In my experience, as soon as you start dealing with character sets,
chaos ensues.  WTF has made matters worse, not better, because now
many people think it is OK to scatter crap all over the place.
In typesetting, the mentioned chaos is unfortunately unavoidable,
and you need to deal with it; but most of the time, it is also easier
to handle there because in most typesetting environments, you deal
with one language at a time, and you know beforehand with which one.

Unless we enjoy pain, bloat and code obfuscation *and* want to be
continuously distracted from serious development, we should keep
mandoc as far away from any kind of charset considerations as
possible.

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv