discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed
From: Ingo Schwarze <schwarze@usta.de>
To: discuss@mdocml.bsd.lv
Cc: Kristaps Dzonsons <kristaps@bsd.lv>
Subject: Re: Ambiguous grammar: unicode vs. \[uX] escapes
Date: Fri, 3 Oct 2014 18:18:20 +0200	[thread overview]
Message-ID: <20141003161820.GA28214@iris.usta.de> (raw)
In-Reply-To: <542C14E1.3040700@bsd.lv>

Hi Kristaps,

Kristaps Dzonsons wrote on Wed, Oct 01, 2014 at 04:51:13PM +0200:

> In adding diacriticals to the shiny new MathML output, I stumbled
> across a curious ambiguity.
> 
> Basically, I wanted the following sequence:
> 
>  { a sub b } under
> 
> Which in eqn(7), means a_b with a line under it all.
> 
> In the new eqn.c, I have a special "bottom" string I set to a
> corresponding under-diacritical.  (The others have a "top" string.)
> I was setting this to \[ul], underscore.  However, the character
> refused to appear.
> 
> Mystified, I explored further.  Then I saw that in print_encode()
> (html.c), the \[ul] was being detected as a Unicode codepoint.  Why?
> Because the sequence is \[uxxx] (mandoc.c:88).

That is definitely a bug.  Obviously, \[ul] is not a Unicode character
escape, but it is documented as a normal character escape sequence
in mandoc_char(7), so we have to make it work.

> Is there any consensus on how we should handle this?  groff_char(7)
> doesn't say anything, but I'm guessing the Unicode codepoints should
> be 4--6 hexdigits long.  That's an easy fix, but I'm not sure if
> it's the right approach.
> 
> Thoughts?

Well, the first thing to check certainly is what groff does.

It outputs nothing for \[uX], \[uXX], \[uXXX], \C'uX', \C'uXX', \C'uXXX'
with one exception: ua and uA are up arrows.

Notably, the u0000 .. u001F control characters get passed through
to the output.  In particular, u0008 (backspace) corrupts the output,
u0009 is an uninterpreted tab, and u000a is an uninterpreted newline.
The non-printable character u007F gets through, too.
The characters in the range u0080 .. u009F seem to be non-printable
Unicode characters and get through verbatim.
Finally, u00A0 .. u0FFF are special symbols.

Usually, unless we have strong reasons to do otherwise, we follow
groff.  So i'd say treating character names as unicode names if and
only if they contain 4 to 6 hex digits is the right thing to do.

I would not, however, accept anything below u0080.  Certainly not
control characters, they can be dangerous.  Printable ASCII characters
encoded as unicode are useless, and even groff treats them
inconsistently: All work with \C'u00XX', but only a few work with
\[u00XX].  So i think we can leave that part as it is, and just do
what you propose.

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

      reply	other threads:[~2014-10-03 16:19 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-01 14:51 Kristaps Dzonsons
2014-10-03 16:18 ` Ingo Schwarze [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141003161820.GA28214@iris.usta.de \
    --to=schwarze@usta.de \
    --cc=discuss@mdocml.bsd.lv \
    --cc=kristaps@bsd.lv \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).