From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailout.scc.kit.edu (mailout.scc.kit.edu [129.13.185.202])
	by krisdoz.my.domain (8.14.5/8.14.5) with ESMTP id s93GJBBI016097
	for <discuss@mdocml.bsd.lv>; Fri, 3 Oct 2014 12:19:12 -0400 (EDT)
Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82])
	by scc-mailout-02.scc.kit.edu with esmtp (Exim 4.72 #1)
	id 1Xa5ZS-0005mE-GT; Fri, 03 Oct 2014 18:19:06 +0200
Received: from donnerwolke.usta.de ([172.24.96.3])
	by hekate.usta.de with esmtp (Exim 4.77)
	(envelope-from <schwarze@usta.de>)
	id 1Xa5ZS-00067T-Ek; Fri, 03 Oct 2014 18:19:06 +0200
Received: from iris.usta.de ([172.24.96.5] helo=usta.de)
	by donnerwolke.usta.de with esmtp (Exim 4.72)
	(envelope-from <schwarze@usta.de>)
	id 1Xa5ZS-0006Ds-Cs; Fri, 03 Oct 2014 18:19:06 +0200
Received: from schwarze by usta.de with local (Exim 4.77)
	(envelope-from <schwarze@usta.de>)
	id 1Xa5Yi-0001Cf-LZ; Fri, 03 Oct 2014 18:18:20 +0200
Date: Fri, 3 Oct 2014 18:18:20 +0200
From: Ingo Schwarze <schwarze@usta.de>
To: discuss@mdocml.bsd.lv
Cc: Kristaps Dzonsons <kristaps@bsd.lv>
Subject: Re: Ambiguous grammar: unicode vs. \[uX] escapes
Message-ID: <20141003161820.GA28214@iris.usta.de>
References: <542C14E1.3040700@bsd.lv>
X-Mailinglist: mdocml-discuss
Reply-To: discuss@mdocml.bsd.lv
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <542C14E1.3040700@bsd.lv>
User-Agent: Mutt/1.5.21 (2010-09-15)

Hi Kristaps,

Kristaps Dzonsons wrote on Wed, Oct 01, 2014 at 04:51:13PM +0200:

> In adding diacriticals to the shiny new MathML output, I stumbled
> across a curious ambiguity.
> 
> Basically, I wanted the following sequence:
> 
>  { a sub b } under
> 
> Which in eqn(7), means a_b with a line under it all.
> 
> In the new eqn.c, I have a special "bottom" string I set to a
> corresponding under-diacritical.  (The others have a "top" string.)
> I was setting this to \[ul], underscore.  However, the character
> refused to appear.
> 
> Mystified, I explored further.  Then I saw that in print_encode()
> (html.c), the \[ul] was being detected as a Unicode codepoint.  Why?
> Because the sequence is \[uxxx] (mandoc.c:88).

That is definitely a bug.  Obviously, \[ul] is not a Unicode character
escape, but it is documented as a normal character escape sequence
in mandoc_char(7), so we have to make it work.

> Is there any consensus on how we should handle this?  groff_char(7)
> doesn't say anything, but I'm guessing the Unicode codepoints should
> be 4--6 hexdigits long.  That's an easy fix, but I'm not sure if
> it's the right approach.
> 
> Thoughts?

Well, the first thing to check certainly is what groff does.

It outputs nothing for \[uX], \[uXX], \[uXXX], \C'uX', \C'uXX', \C'uXXX'
with one exception: ua and uA are up arrows.

Notably, the u0000 .. u001F control characters get passed through
to the output.  In particular, u0008 (backspace) corrupts the output,
u0009 is an uninterpreted tab, and u000a is an uninterpreted newline.
The non-printable character u007F gets through, too.
The characters in the range u0080 .. u009F seem to be non-printable
Unicode characters and get through verbatim.
Finally, u00A0 .. u0FFF are special symbols.

Usually, unless we have strong reasons to do otherwise, we follow
groff.  So i'd say treating character names as unicode names if and
only if they contain 4 to 6 hex digits is the right thing to do.

I would not, however, accept anything below u0080.  Certainly not
control characters, they can be dangerous.  Printable ASCII characters
encoded as unicode are useless, and even groff treats them
inconsistently: All work with \C'u00XX', but only a few work with
\[u00XX].  So i think we can leave that part as it is, and just do
what you propose.

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv