discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed
* Accents vs. combining accents
@ 2014-02-26  7:51 Anthony J. Bentley
  2014-02-28 17:06 ` Ingo Schwarze
  0 siblings, 1 reply; 2+ messages in thread
From: Anthony J. Bentley @ 2014-02-26  7:51 UTC (permalink / raw)
  To: discuss; +Cc: Ted Unangst

Mandoc misbehaves a bit when printing accents in UTF-8. In summary:

- Under normal circumstances, \` and \' should print spacing accents
and not combining accents.
- Maybe we should consider printing real quotes (U+2018/9) on raw `
and ' in UTF-8 mode. Maybe worth bringing up on the groff list too.


texinfo2man (part of the devel/gindent build process) converts the
following info source:

`slithy_toves.c'

into the following man(7) source:

\`slithy_toves.c\'

This is, of course, wrong. ` and ' in TeX represent left and right
single quotes, but in troff \` and \' are accents, not quotation
marks, so this is a bug in texinfo2man.

mandoc also has a bug in this situation. It represents \` and \' as
combining grave and acute accents (U+0300 and U+0301, respectively).
But according to groff.info, section "Using Symbols":


 -- Escape: \'
     This is a backslash followed by the apostrophe character, ASCII
     character `0x27' (EBCDIC character `0x7D').  The same as `\[aa]',
     the acute accent.

 -- Escape: \`
     This is a backslash followed by ASCII character `0x60' (EBCDIC
     character `0x79' usually).  The same as `\[ga]', the grave accent.


And in turn, groff_char(7):


       The composite request is used to map most of the accents to non-spacing
       glyph names; the values given in parentheses are the original (spacing)
       ones.

       Output   Input   PostScript     Unicode         Notes
       ------------------------------------------------------------
       '        \[aa]   acute          u0301 (u00B4)   +
       `        \[ga]   grave          u0300 (u0060)   +


In situations with no composite request (I guess?), mandoc should
print U+0060 and U+00B4 for \` and \' respectively, as groff does.
Wrongly printing combining accents as it does now leads to dramatic
visual artifacts throughout the manpage.

(Side note: in TeXinfo, ` and ' represent quote marks while \` and \'
represent accents. In troff, ` and ' represent quote marks while \`
and \' represent accents. A bit of overzealous escaping on
texinfo2man's part. But then again, it looks like neither groff nor
mandoc actually represent ` and ' as accents except in print formats
like PDF. Which means that (unless groff and/or mandoc start
converting ` and ' in UTF-8 output, which neither currently do) the
real correct characters there are \(oq and \(cq . texinfo2man had a
patch submitted to use those in 2005, but it never got committed...
sigh. I'll nudge upstream, and try to push this to ports after
unlock.)

(Boy, lots of manpages wrongly escape the ' character. afm2pl(1),
bzr(1), curl(1), kpsetool(1), lacheck(1), makeindex(1), mendex(1),
pdfinfo(1), pdftops(1), xsltproc(1)... Maybe people should be forced
to run their manpages through gropdf and make sure it looks
typographically pretty!)

-- 
Anthony J. Bentley
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Accents vs. combining accents
  2014-02-26  7:51 Accents vs. combining accents Anthony J. Bentley
@ 2014-02-28 17:06 ` Ingo Schwarze
  0 siblings, 0 replies; 2+ messages in thread
From: Ingo Schwarze @ 2014-02-28 17:06 UTC (permalink / raw)
  To: discuss; +Cc: Ted Unangst, Anthony J. Bentley

Hi Anthony,

Anthony J. Bentley wrote on Wed, Feb 26, 2014 at 12:51:38AM -0700:

> Mandoc misbehaves a bit when printing accents in UTF-8.

Right, see below for a patch which i intend to commit.

> In summary:
> 
> - Under normal circumstances, \` and \' should print spacing accents
> and not combining accents.

Correct.  The same is true for many other accent escape sequences.

> - Maybe we should consider printing real quotes (U+2018/9) on raw `
> and ' in UTF-8 mode. Maybe worth bringing up on the groff list too.

That would cause similar issues with copy and paste like the ones
just discussed regarding hyphens and dashes.  So at least for
manuals, it would probably have to be disabled right away,
just like manuals output the ASCII character for - and \-.

I tend to agree with Dmitrij Czarkoff that plain ASCII input
in better left as-is, and people should use escape sequences
if they want specific fancy UTF-8 characters (except that in
manuals, they probably shouldn't, it merely harms portability
to ask for fancy characters).

[...]
> In situations with no composite request (I guess?), mandoc should
> print U+0060 and U+00B4 for \` and \' respectively, as groff does.

Correct.  Mandoc doesn't support escape sequences involving
composite characters at all, so mandoc has to use the codes
you cite in all cases.

Yours,
  Ingo

P.S.
I snipped your discussion of texinfo2man, which makes sense to me.


Index: chars.in
===================================================================
RCS file: /cvs/src/usr.bin/mandoc/chars.in,v
retrieving revision 1.20
diff -u -p -r1.20 chars.in
--- chars.in	22 Jan 2014 20:58:35 -0000	1.20
+++ chars.in	28 Feb 2014 16:46:26 -0000
@@ -49,21 +49,21 @@ CHAR("c",			"",		0)
 CHAR("}",			"",		0)
 
 /* Accents. */
-CHAR("a\"",			"\"",		779)
+CHAR("a\"",			"\"",		733)
 CHAR("a-",			"-",		175)
 CHAR("a.",			".",		729)
-CHAR("a^",			"^",		770)
-CHAR("\'",			"\'",		769)
-CHAR("aa",			"\'",		769)
-CHAR("ga",			"`",		768)
-CHAR("`",			"`",		768)
-CHAR("ab",			"`",		774)
-CHAR("ac",			",",		807)
-CHAR("ad",			"\"",		776)
+CHAR("a^",			"^",		94)
+CHAR("\'",			"\'",		180)
+CHAR("aa",			"\'",		180)
+CHAR("ga",			"`",		96)
+CHAR("`",			"`",		96)
+CHAR("ab",			"`",		728)
+CHAR("ac",			",",		184)
+CHAR("ad",			"\"",		168)
 CHAR("ah",			"v",		711)
 CHAR("ao",			"o",		730)
-CHAR("a~",			"~",		771)
-CHAR("ho",			",",		808)
+CHAR("a~",			"~",		126)
+CHAR("ho",			",",		731)
 CHAR("ha",			"^",		94)
 CHAR("ti",			"~",		126)
 
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-02-28 17:06 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-26  7:51 Accents vs. combining accents Anthony J. Bentley
2014-02-28 17:06 ` Ingo Schwarze

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).