Ambiguous grammar: unicode vs. \[uX] escapes

discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed

* Ambiguous grammar: unicode vs. \[uX] escapes
@ 2014-10-01 14:51 Kristaps Dzonsons
  2014-10-03 16:18 ` Ingo Schwarze
  0 siblings, 1 reply; 2+ messages in thread
From: Kristaps Dzonsons @ 2014-10-01 14:51 UTC (permalink / raw)
  To: discuss

In adding diacriticals to the shiny new MathML output, I stumbled across 
a curious ambiguity.

Basically, I wanted the following sequence:

  { a sub b } under

Which in eqn(7), means a_b with a line under it all.

In the new eqn.c, I have a special "bottom" string I set to a 
corresponding under-diacritical.  (The others have a "top" string.)  I 
was setting this to \[ul], underscore.  However, the character refused 
to appear.

Mystified, I explored further.  Then I saw that in print_encode() 
(html.c), the \[ul] was being detected as a Unicode codepoint.  Why? 
Because the sequence is \[uxxx] (mandoc.c:88).

Is there any consensus on how we should handle this?  groff_char(7) 
doesn't say anything, but I'm guessing the Unicode codepoints should be 
4--6 hexdigits long.  That's an easy fix, but I'm not sure if it's the 
right approach.

Thoughts?
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Ambiguous grammar: unicode vs. \[uX] escapes
  2014-10-01 14:51 Ambiguous grammar: unicode vs. \[uX] escapes Kristaps Dzonsons
@ 2014-10-03 16:18 ` Ingo Schwarze
  0 siblings, 0 replies; 2+ messages in thread
From: Ingo Schwarze @ 2014-10-03 16:18 UTC (permalink / raw)
  To: discuss; +Cc: Kristaps Dzonsons

Hi Kristaps,

Kristaps Dzonsons wrote on Wed, Oct 01, 2014 at 04:51:13PM +0200:

> In adding diacriticals to the shiny new MathML output, I stumbled
> across a curious ambiguity.
> 
> Basically, I wanted the following sequence:
> 
>  { a sub b } under
> 
> Which in eqn(7), means a_b with a line under it all.
> 
> In the new eqn.c, I have a special "bottom" string I set to a
> corresponding under-diacritical.  (The others have a "top" string.)
> I was setting this to \[ul], underscore.  However, the character
> refused to appear.
> 
> Mystified, I explored further.  Then I saw that in print_encode()
> (html.c), the \[ul] was being detected as a Unicode codepoint.  Why?
> Because the sequence is \[uxxx] (mandoc.c:88).

That is definitely a bug.  Obviously, \[ul] is not a Unicode character
escape, but it is documented as a normal character escape sequence
in mandoc_char(7), so we have to make it work.

> Is there any consensus on how we should handle this?  groff_char(7)
> doesn't say anything, but I'm guessing the Unicode codepoints should
> be 4--6 hexdigits long.  That's an easy fix, but I'm not sure if
> it's the right approach.
> 
> Thoughts?

Well, the first thing to check certainly is what groff does.

It outputs nothing for \[uX], \[uXX], \[uXXX], \C'uX', \C'uXX', \C'uXXX'
with one exception: ua and uA are up arrows.

Notably, the u0000 .. u001F control characters get passed through
to the output.  In particular, u0008 (backspace) corrupts the output,
u0009 is an uninterpreted tab, and u000a is an uninterpreted newline.
The non-printable character u007F gets through, too.
The characters in the range u0080 .. u009F seem to be non-printable
Unicode characters and get through verbatim.
Finally, u00A0 .. u0FFF are special symbols.

Usually, unless we have strong reasons to do otherwise, we follow
groff.  So i'd say treating character names as unicode names if and
only if they contain 4 to 6 hex digits is the right thing to do.

I would not, however, accept anything below u0080.  Certainly not
control characters, they can be dangerous.  Printable ASCII characters
encoded as unicode are useless, and even groff treats them
inconsistently: All work with \C'u00XX', but only a few work with
\[u00XX].  So i think we can leave that part as it is, and just do
what you propose.

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-10-03 16:19 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-01 14:51 Ambiguous grammar: unicode vs. \[uX] escapes Kristaps Dzonsons
2014-10-03 16:18 ` Ingo Schwarze

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).