From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from theseus.empoweringmedia.net (theseus.empoweringmedia.net [67.72.106.8]) by krisdoz.my.domain (8.14.3/8.14.3) with ESMTP id o6AI8xYt015905 for ; Sat, 10 Jul 2010 14:09:00 -0400 (EDT) Received: from 32.sub-97-154-124.myvzw.com ([97.154.124.32] helo=lynx.foo.bar) by theseus.empoweringmedia.net with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.69) (envelope-from ) id 1OXeTt-0002Vx-9k; Sat, 10 Jul 2010 14:08:53 -0400 Date: Sat, 10 Jul 2010 11:11:18 -0700 From: "J.C. Roberts" To: discuss@mdocml.bsd.lv Cc: Ulrich =?ISO-8859-1?Q?Sp=F6rlein?= Subject: Re: Raw UTF-8? Message-Id: <20100710111118.1930f01c.list-jcr@designtools.org> In-Reply-To: <20100709210539.GA2465@roadrunner.spoerlein.net> References: <4c33f0f0.0c87970a.3458.fffff43f@mx.google.com> <20100707185815.GA19725@iris.usta.de> <20100707191807.GA18154@britannica.bec.de> <20100707211212.GC19725@iris.usta.de> <20100707211725.GA29241@britannica.bec.de> <20100709210539.GA2465@roadrunner.spoerlein.net> Organization: - X-W00F: /\/\E0\/\/ X-Mailer: - X-Mailinglist: mdocml-discuss Reply-To: discuss@mdocml.bsd.lv Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Fri, 9 Jul 2010 22:05:39 +0100 Ulrich Sp__rlein wrote: > > On Wed, 07.07.2010 at 23:17:25 +0200, Joerg Sonnenberger wrote: > > On Wed, Jul 07, 2010 at 11:12:12PM +0200, Ingo Schwarze wrote: > > > Hi Joerg, > > > > > > Joerg Sonnenberger wrote on Wed, Jul 07, 2010 at 09:18:08PM +0200: > > > > > > > Consider my name -- I would strongly hope that output devices > > > > with proper Latin1/Latin15/UTF-8 support to use the diacrit, > > > > but fall back to the transliterated version otherwise. > > > > > > You hope in vain. Did you try? > > > > Yes. lp(1) on NetBSD is such an example. It does the right thing > > with groff. Depending on the output device (-Tlatin1 vs -Tascii), > > it will either use the umlaut for \(:o or the oe transliteration. > > This also works fine with FreeBSD's groff when rendering to UTF-8 > aware terminals using -Tutf8 (and of course in -Tps and -Thtml mode). > > I really hope the sentiment expressed in this thread is in jest, as I > would stop considering mandoc(1) a viable alternative for FreeBSD's > man subsystem if it will never support UTF-8 output (and then render > \(:o as __ like it should). > > Uli I doubt Ingo was joking and I do understand his concerns, but I agree UTF-8 support is very important and many consider it a "requirement" these days. Personally, I think the \(:o syntax is nonsense. It's an ancient and sad work-around addressing only one of the countless transliterations and/or translations needed for a complete solution. If we tried to create 7-bit strings like this for every possible transliteration and/or translation of every non-ascii character, the list would be absolutely humongous and computationally intractable as well as still being incomplete and often totally inaccurate. UTF-8 sucks less. Since we now have UTF, it seems better to error out on the archaic \(:o syntax to prompt change, rather than support it to prolong a bad idea and yet another syntax everyone needs to learn. More importantly, the real problem is the *idea* of automating transliteration. If you think it through, you'll realize automated transliteration cannot be completely solved. A complete solution would require an accurate transliteration, or even translation, to ascii of every non-ascii character, as well as doing so correctly for every possible language/usage/context. In essence, you are asking for perfect automated translation even when perfect manual translation can be impossible in some situations. Given the need to support ascii-only terminals/outputs, and given the need to support non-ascii characters, and given fully automated transliteration/translation is currently impossible, at first glance it seems there is an irreconcilable conflict. Luckily, we can look at it again. And there is a way to resolve it. Since we cannot solve the problem of automated transliteration (and hence, automated translation) for all cases, the idea itself is flawed. The best thing to do is change the problem we're trying to solve. Instead of trying to automate the transliteration/translation of non-ascii characters, we can impose a simple requirement. The most simple answer would be allow non-ascii if and only if an ascii equivalent is provided, otherwise error. This puts both the option to use non-ascii characters as well as the responsibility of correct transliteration/translation in the hands of the author. I don't mean to pick on Joerg, but names are excellent examples as well as one of the most compelling reasons to have proper support for non-ascii characters. A format something like: {ascii, utf-8} such as: J{oe, \u00F6;}rg or {Joerg, J\u00F6rg} or {Joerg Sonnenberger, J\u00F6rg Sonnenberger} Ummm... no, the above is just an example, not a suggestion of syntax. There's probably an existing IF-THEN-ELSE which could be leveraged without undue overhead, but you get the main idea... --make the author provide both the THEN and the ELSE. In many situations, even when a terminal is capable of displaying the UTF-8, it could still be beneficial to also display the ascii, possibly in parenthesis. There are plenty of idiots like me who do not know how to pronounce or even type an "o" with a diacritic, so showing the ascii transliterated/translated version really does help. If you saw a formal name in Japanese, Arabic, Thai, Russian, or any language you don't know, written in it's native character set, could you pronounce or type it? Worse yet, when it comes to the ascii-fication of non-ascii names of people, there are tons of variations and different people have different preferences, so the result is there is no "right" way to do it and the best practice is to avoid offense by requiring a transcription or translation from the person. And then there are the people who really want to be unique (like everyone else) and intentionally (mis?)spell their name in their own words... http://pichaus.com/wattoom-zink-pubbawup-gazork-@911d3fd1ffb157c0e23066faca4cf751/ The state of California could not write out my full name correctly on my drivers license, and the US Federal Government could not write out my full name correctly on my passport, but at least the latter sent me a nice "Sorry for the inconvenience" letter. Though I personally learned not to care about it at an early age, most people are offended if you get their name wrong. Throwing an error if an ascii equivalent is not provided is fairly harsh, but it is necessary to prevent hidden information on ascii-only terminals/outputs and also to prevent offending anyone (by either omission in ascii, or by misspelling the ascii equivalent). jcr -- To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv