Re: Raw UTF-8? - J.C. Roberts

discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed

From: "J.C. Roberts" <list-jcr@designtools.org>
To: discuss@mdocml.bsd.lv
Cc: "Ulrich Spörlein" <uqs@spoerlein.net>
Subject: Re: Raw UTF-8?
Date: Sat, 10 Jul 2010 11:11:18 -0700	[thread overview]
Message-ID: <20100710111118.1930f01c.list-jcr@designtools.org> (raw)
In-Reply-To: <20100709210539.GA2465@roadrunner.spoerlein.net>

On Fri, 9 Jul 2010 22:05:39 +0100 Ulrich Sp__rlein <uqs@spoerlein.net>
wrote:
>
> On Wed, 07.07.2010 at 23:17:25 +0200, Joerg Sonnenberger wrote:
> > On Wed, Jul 07, 2010 at 11:12:12PM +0200, Ingo Schwarze wrote:
> > > Hi Joerg,
> > > 
> > > Joerg Sonnenberger wrote on Wed, Jul 07, 2010 at 09:18:08PM +0200:
> > > 
> > > > Consider my name -- I would strongly hope that output devices
> > > > with proper Latin1/Latin15/UTF-8 support to use the diacrit,
> > > > but fall back to the transliterated version otherwise.
> > > 
> > > You hope in vain.  Did you try?
> > 
> > Yes. lp(1) on NetBSD is such an example. It does the right thing
> > with groff. Depending on the output device (-Tlatin1 vs -Tascii),
> > it will either use the umlaut for \(:o or the oe transliteration.
> 
> This also works fine with FreeBSD's groff when rendering to UTF-8
> aware terminals using -Tutf8 (and of course in -Tps and -Thtml mode).
> 
> I really hope the sentiment expressed in this thread is in jest, as I
> would stop considering mandoc(1) a viable alternative for FreeBSD's
> man subsystem if it will never support UTF-8 output (and then render
> \(:o as __ like it should).
> 
> Uli

I doubt Ingo was joking and I do understand his concerns, but I agree
UTF-8 support is very important and many consider it a "requirement"
these days.

Personally, I think the \(:o syntax is nonsense. It's an ancient and sad
work-around addressing only one of the countless transliterations and/or
translations needed for a complete solution. If we tried to create 7-bit
strings like this for every possible transliteration and/or translation
of every non-ascii character, the list would be absolutely humongous and
computationally intractable as well as still being incomplete and often
totally inaccurate.

UTF-8 sucks less.

Since we now have UTF, it seems better to error out on the archaic \(:o
syntax to prompt change, rather than support it to prolong a bad idea
and yet another syntax everyone needs to learn.

More importantly, the real problem is the *idea* of automating
transliteration. If you think it through, you'll realize automated
transliteration cannot be completely solved. A complete solution would
require an accurate transliteration, or even translation, to ascii of
every non-ascii character, as well as doing so correctly for every
possible language/usage/context. In essence, you are asking for perfect
automated translation even when perfect manual translation can be
impossible in some situations.

Given the need to support ascii-only terminals/outputs, and given the
need to support non-ascii characters, and given fully automated
transliteration/translation is currently impossible, at first glance it
seems there is an irreconcilable conflict.

Luckily, we can look at it again.
And there is a way to resolve it.

Since we cannot solve the problem of automated transliteration (and
hence, automated translation) for all cases, the idea itself is flawed.
The best thing to do is change the problem we're trying to solve.
Instead of trying to automate the transliteration/translation of
non-ascii characters, we can impose a simple requirement.

The most simple answer would be allow non-ascii if and only if an ascii
equivalent is provided, otherwise error. This puts both the option to
use non-ascii characters as well as the responsibility of correct
transliteration/translation in the hands of the author.

I don't mean to pick on Joerg, but names are excellent examples as well
as one of the most compelling reasons to have proper support for
non-ascii characters. A format something like:

	{ascii, utf-8}

such as:
 	J{oe, \u00F6;}rg
or
	{Joerg, J\u00F6rg}
or
	{Joerg Sonnenberger, J\u00F6rg Sonnenberger}

Ummm... no, the above is just an example, not a suggestion of syntax.
There's probably an existing IF-THEN-ELSE which could be leveraged
without undue overhead, but you get the main idea... --make the author
provide both the THEN and the ELSE.

In many situations, even when a terminal is capable of displaying the
UTF-8, it could still be beneficial to also display the ascii, possibly
in parenthesis. There are plenty of idiots like me who do not know how
to pronounce or even type an "o" with a diacritic, so showing the ascii
transliterated/translated version really does help. If you saw a formal
name in Japanese, Arabic, Thai, Russian, or any language you don't know,
written in it's native character set, could you pronounce or type it?
Worse yet, when it comes to the ascii-fication of non-ascii names of
people, there are tons of variations and different people have different
preferences, so the result is there is no "right" way to do it and the
best practice is to avoid offense by requiring a transcription or
translation from the person. And then there are the people who really
want to be unique (like everyone else) and intentionally (mis?)spell
their name in their own words...

http://pichaus.com/wattoom-zink-pubbawup-gazork-@911d3fd1ffb157c0e23066faca4cf751/

The state of California could not write out my full name correctly on my
drivers license, and the US Federal Government could not write out my
full name correctly on my passport, but at least the latter sent me a
nice "Sorry for the inconvenience" letter. Though I personally learned
not to care about it at an early age, most people are offended if you
get their name wrong.

Throwing an error if an ascii equivalent is not provided is fairly
harsh, but it is necessary to prevent hidden information on ascii-only
terminals/outputs and also to prevent offending anyone (by either
omission in ascii, or by misspelling the ascii equivalent).

	jcr
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

next prev parent reply	other threads:[~2010-07-10 18:09 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-07  3:13 Anthony J. Bentley
2010-07-07  9:33 ` Kristaps Dzonsons
2010-07-07 14:39   ` Anthony J. Bentley
2010-07-07 20:13     ` Ingo Schwarze
2010-07-07 18:58 ` Ingo Schwarze
2010-07-07 19:18   ` Joerg Sonnenberger
2010-07-07 21:12     ` Ingo Schwarze
2010-07-07 21:17       ` Joerg Sonnenberger
2010-07-09 21:05         ` Ulrich Spörlein
2010-07-10 18:11           ` J.C. Roberts [this message]
2010-07-11 22:17             ` Ingo Schwarze
2010-07-11 22:38           ` Kristaps Dzonsons
2010-07-13 19:23             ` Ulrich Spörlein
2010-07-13 23:25               ` Kristaps Dzonsons

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100710111118.1930f01c.list-jcr@designtools.org \
    --to=list-jcr@designtools.org \
    --cc=discuss@mdocml.bsd.lv \
    --cc=uqs@spoerlein.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).