From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp1.rz.uni-karlsruhe.de (Debian-exim@smtp1.rz.uni-karlsruhe.de [129.13.185.217]) by krisdoz.my.domain (8.14.3/8.14.3) with ESMTP id o67LCD2X005415 for ; Wed, 7 Jul 2010 17:12:14 -0400 (EDT) Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82]) by smtp1.rz.uni-karlsruhe.de with esmtp (Exim 4.63 #1) id 1OWbue-0007Pe-M4; Wed, 07 Jul 2010 23:12:12 +0200 Received: from donnerwolke.usta.de ([172.24.96.3]) by hekate.usta.de with esmtp (Exim 4.71) (envelope-from ) id 1OWbue-0001BX-L6 for discuss@mdocml.bsd.lv; Wed, 07 Jul 2010 23:12:12 +0200 Received: from iris.usta.de ([172.24.96.5] helo=usta.de) by donnerwolke.usta.de with esmtp (Exim 4.69) (envelope-from ) id 1OWbue-0004KC-KI for discuss@mdocml.bsd.lv; Wed, 07 Jul 2010 23:12:12 +0200 Received: from schwarze by usta.de with local (Exim 4.71) (envelope-from ) id 1OWbue-0007bN-JJ for discuss@mdocml.bsd.lv; Wed, 07 Jul 2010 23:12:12 +0200 Date: Wed, 7 Jul 2010 23:12:12 +0200 From: Ingo Schwarze To: discuss@mdocml.bsd.lv Subject: Re: Raw UTF-8? Message-ID: <20100707211212.GC19725@iris.usta.de> References: <4c33f0f0.0c87970a.3458.fffff43f@mx.google.com> <20100707185815.GA19725@iris.usta.de> <20100707191807.GA18154@britannica.bec.de> X-Mailinglist: mdocml-discuss Reply-To: discuss@mdocml.bsd.lv MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100707191807.GA18154@britannica.bec.de> User-Agent: Mutt/1.5.20 (2009-06-14) Hi Joerg, Joerg Sonnenberger wrote on Wed, Jul 07, 2010 at 09:18:08PM +0200: > Consider my name -- I would strongly hope that output devices with > proper Latin1/Latin15/UTF-8 support to use the diacrit, but fall > back to the transliterated version otherwise. You hope in vain. Did you try? Both old and new groff render that as 'J"\borg Sonnenberger', which looks like "Jorg Sonnenberger" on a typical terminal. Maybe the reason for using the unreliable backspace-encoding variant instead of the transliteration "oe" is that more languages than just german might use the "LATIN SMALL LETTER O WITH DIAERESIS", as Unicode calls it, and who knows how a good transliteration from those languages into ASCII might look like? The point is, for correct results, you must transliterate before encoding, when you still know the context, e.g. the language, which is often required to figure out a correct transliteration. Thus, you should really use .An Joerg Sonnenberger and never .An J\(:org Sonnenberger when documenting your programs. > You know that C99 just like many other modern language (dialects) > allow full 8bit input? I know that some do, and i have fought with Python code garbled in that way, and all the more do i call it insane. > The primary problem I have with using 8bit input for mandoc(1) (or groff > in general) is that it doesn't have a way to specify the input character > set. If that is addressed, the discussion would move to the more > interesting point of transliteration. In my experience, as soon as you start dealing with character sets, chaos ensues. WTF has made matters worse, not better, because now many people think it is OK to scatter crap all over the place. In typesetting, the mentioned chaos is unfortunately unavoidable, and you need to deal with it; but most of the time, it is also easier to handle there because in most typesetting environments, you deal with one language at a time, and you know beforehand with which one. Unless we enjoy pain, bloat and code obfuscation *and* want to be continuously distracted from serious development, we should keep mandoc as far away from any kind of charset considerations as possible. Yours, Ingo -- To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv