From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailout.scc.kit.edu (mailout.scc.kit.edu [129.13.185.202]) by krisdoz.my.domain (8.14.5/8.14.5) with ESMTP id s93GJBBI016097 for ; Fri, 3 Oct 2014 12:19:12 -0400 (EDT) Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82]) by scc-mailout-02.scc.kit.edu with esmtp (Exim 4.72 #1) id 1Xa5ZS-0005mE-GT; Fri, 03 Oct 2014 18:19:06 +0200 Received: from donnerwolke.usta.de ([172.24.96.3]) by hekate.usta.de with esmtp (Exim 4.77) (envelope-from ) id 1Xa5ZS-00067T-Ek; Fri, 03 Oct 2014 18:19:06 +0200 Received: from iris.usta.de ([172.24.96.5] helo=usta.de) by donnerwolke.usta.de with esmtp (Exim 4.72) (envelope-from ) id 1Xa5ZS-0006Ds-Cs; Fri, 03 Oct 2014 18:19:06 +0200 Received: from schwarze by usta.de with local (Exim 4.77) (envelope-from ) id 1Xa5Yi-0001Cf-LZ; Fri, 03 Oct 2014 18:18:20 +0200 Date: Fri, 3 Oct 2014 18:18:20 +0200 From: Ingo Schwarze To: discuss@mdocml.bsd.lv Cc: Kristaps Dzonsons Subject: Re: Ambiguous grammar: unicode vs. \[uX] escapes Message-ID: <20141003161820.GA28214@iris.usta.de> References: <542C14E1.3040700@bsd.lv> X-Mailinglist: mdocml-discuss Reply-To: discuss@mdocml.bsd.lv MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <542C14E1.3040700@bsd.lv> User-Agent: Mutt/1.5.21 (2010-09-15) Hi Kristaps, Kristaps Dzonsons wrote on Wed, Oct 01, 2014 at 04:51:13PM +0200: > In adding diacriticals to the shiny new MathML output, I stumbled > across a curious ambiguity. > > Basically, I wanted the following sequence: > > { a sub b } under > > Which in eqn(7), means a_b with a line under it all. > > In the new eqn.c, I have a special "bottom" string I set to a > corresponding under-diacritical. (The others have a "top" string.) > I was setting this to \[ul], underscore. However, the character > refused to appear. > > Mystified, I explored further. Then I saw that in print_encode() > (html.c), the \[ul] was being detected as a Unicode codepoint. Why? > Because the sequence is \[uxxx] (mandoc.c:88). That is definitely a bug. Obviously, \[ul] is not a Unicode character escape, but it is documented as a normal character escape sequence in mandoc_char(7), so we have to make it work. > Is there any consensus on how we should handle this? groff_char(7) > doesn't say anything, but I'm guessing the Unicode codepoints should > be 4--6 hexdigits long. That's an easy fix, but I'm not sure if > it's the right approach. > > Thoughts? Well, the first thing to check certainly is what groff does. It outputs nothing for \[uX], \[uXX], \[uXXX], \C'uX', \C'uXX', \C'uXXX' with one exception: ua and uA are up arrows. Notably, the u0000 .. u001F control characters get passed through to the output. In particular, u0008 (backspace) corrupts the output, u0009 is an uninterpreted tab, and u000a is an uninterpreted newline. The non-printable character u007F gets through, too. The characters in the range u0080 .. u009F seem to be non-printable Unicode characters and get through verbatim. Finally, u00A0 .. u0FFF are special symbols. Usually, unless we have strong reasons to do otherwise, we follow groff. So i'd say treating character names as unicode names if and only if they contain 4 to 6 hex digits is the right thing to do. I would not, however, accept anything below u0080. Certainly not control characters, they can be dangerous. Printable ASCII characters encoded as unicode are useless, and even groff treats them inconsistently: All work with \C'u00XX', but only a few work with \[u00XX]. So i think we can leave that part as it is, and just do what you propose. Yours, Ingo -- To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv