discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed
* FWD: man.conf mandoc -Tlocale
@ 2014-02-14 13:06 ` Ingo Schwarze
  2014-02-15  8:43   ` Thomas Klausner
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Schwarze @ 2014-02-14 13:06 UTC (permalink / raw)
  To: discuss

Hi,

in OpenBSD, we are discussing to move to mandoc(1) default
from -Tascii to -Tlocale, see the mail on <tech@openbsd.org>
below.

How do you feel about that idea, in particular regarding other
operating systems like DragonFly, NetBSD, FreeBSD and from the
perspective of the pkgsrc packaging system?

Yours,
  Ingo

----- Forwarded message from Ingo Schwarze <schwarze@usta.de> -----

From: Ingo Schwarze <schwarze@usta.de>
Sender: owner-tech@openbsd.org
Date: Fri, 14 Feb 2014 14:02:26 +0100
To: Ted Unangst <tedu@tedunangst.com>
Cc: tech@openbsd.org
Subject: Re: man.conf mandoc -Tlocale

Hi Ted,

Ted Unangst wrote on Thu, Feb 13, 2014 at 09:22:04PM -0500:

> About 20 years after the invention of utf-8, I've decided to see what
> all the fuss is about and experiment with uxterm and whatnot.
> Naturally, this means I want to see sweet fancy quotes in all my man
> pages instead of the lame ``fake'' quotes. In order to convince mandoc
> to give me what I want, however, requires a command line option. But
> what about all those old school ascii only terminals I still sometimes
> use?
> 
> mandoc fortunately has an option -Tlocale, which will pick between
> ascii and utf8 based on environment. Perfect! Let's use it.
> 
> Tested to work as expected in uxterm. Tested to change nothing in a
> regular xterm by default (no LC_CTYPE set).

Even though i don't use it, i'm not opposed to your patch.
I think it makes sense.

I even considered switching the mandoc(1) default from -Tascii to
-Tlocale in general, but forgot about it again.  If you like the
idea, that would be something to do after unlock; it might require
explicitly giving the -Tascii option in some build system and similar
contexts.

I think -Tlocale might be a saner default than -Tascii nowadays.
People who don't want UTF-8 shouldn't have it in their LC_CTYPE,
and it's hard to see why people who do want it and have it in their
LC_CTYPE should be forced to give -Tlocale or something similar
to each and every utility they call.

What do you think?
  Ingo


> Index: man.conf
> ===================================================================
> RCS file: /cvs/src/etc/man.conf,v
> retrieving revision 1.18
> diff -u -p -r1.18 man.conf
> --- man.conf	13 Jul 2013 20:21:52 -0000	1.18
> +++ man.conf	14 Feb 2014 02:14:29 -0000
> @@ -16,15 +16,15 @@ _subdir		{cat,man}1 {cat,man}8 {cat,man}
>  _suffix		.0
>  _build		.0.Z		/usr/bin/zcat %s
>  _build		.0.gz		/usr/bin/gzcat %s
> -_build		.[1-9n]		/usr/bin/mandoc %s
> -_build		.[1-9n].Z	/usr/bin/zcat %s | /usr/bin/mandoc
> -_build		.[1-9n].gz	/usr/bin/gzcat %s | /usr/bin/mandoc
> -_build		.[1-9][a-z]	/usr/bin/mandoc %s
> -_build		.[1-9][a-z].Z	/usr/bin/zcat %s | /usr/bin/mandoc
> -_build		.[1-9][a-z].gz	/usr/bin/gzcat %s | /usr/bin/mandoc
> -_build		.tbl		/usr/bin/mandoc %s
> -_build		.tbl.Z		/usr/bin/zcat %s | /usr/bin/mandoc
> -_build		.tbl.gz		/usr/bin/gzcat %s | /usr/bin/mandoc
> +_build		.[1-9n]		/usr/bin/mandoc -Tlocale %s
> +_build		.[1-9n].Z	/usr/bin/zcat %s | /usr/bin/mandoc -Tlocale
> +_build		.[1-9n].gz	/usr/bin/gzcat %s | /usr/bin/mandoc -Tlocale
> +_build		.[1-9][a-z]	/usr/bin/mandoc -Tlocale %s
> +_build		.[1-9][a-z].Z	/usr/bin/zcat %s | /usr/bin/mandoc -Tlocale
> +_build		.[1-9][a-z].gz	/usr/bin/gzcat %s | /usr/bin/mandoc -Tlocale
> +_build		.tbl		/usr/bin/mandoc -Tlocale %s
> +_build		.tbl.Z		/usr/bin/zcat %s | /usr/bin/mandoc -Tlocale
> +_build		.tbl.gz		/usr/bin/gzcat %s | /usr/bin/mandoc -Tlocale
>  
>  # Sections and their directories.
>  # All paths ending in '/' are the equivalent of entries specifying that
> 


----- End forwarded message -----
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FWD: man.conf mandoc -Tlocale
  2014-02-14 13:06 ` FWD: man.conf mandoc -Tlocale Ingo Schwarze
@ 2014-02-15  8:43   ` Thomas Klausner
  2014-02-15  9:42     ` Ingo Schwarze
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Klausner @ 2014-02-15  8:43 UTC (permalink / raw)
  To: discuss

On Fri, Feb 14, 2014 at 02:06:47PM +0100, Ingo Schwarze wrote:
> in OpenBSD, we are discussing to move to mandoc(1) default
> from -Tascii to -Tlocale, see the mail on <tech@openbsd.org>
> below.
> 
> How do you feel about that idea, in particular regarding other
> operating systems like DragonFly, NetBSD, FreeBSD and from the
> perspective of the pkgsrc packaging system?

I've tried this on the NetBSD man ls(1) man page with
LC_CTYPE=de_DE.UTF-8 and didn't see a difference.

# man ls > ls.default
man: Formatting manual page...
# mandoc -Tlocale /usr/share/man/man1/ls.1 > ls.locale
# diff ls.*
#

Ideas why, or is this expected?

One thing I remember being broken at some point: Does this still allow
examples to be copied, or do we have to be extra careful about marking
them up then?

At some point (sorry, I don't remember details, not even if it was
mandoc or groff) I had the annoying state where 'man foo' replaced
dashes with some UTF-8 dash that the shell didn't accept as when
pasting it in a shell.
 Thomas
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FWD: man.conf mandoc -Tlocale
  2014-02-15  8:43   ` Thomas Klausner
@ 2014-02-15  9:42     ` Ingo Schwarze
  2014-02-16 20:56       ` Ingo Schwarze
  2014-03-13  9:16       ` Thomas Klausner
  0 siblings, 2 replies; 7+ messages in thread
From: Ingo Schwarze @ 2014-02-15  9:42 UTC (permalink / raw)
  To: Thomas Klausner; +Cc: discuss

Hi Thomas,

Thomas Klausner wrote on Sat, Feb 15, 2014 at 09:43:09AM +0100:
> On Fri, Feb 14, 2014 at 02:06:47PM +0100, Ingo Schwarze wrote:

>> in OpenBSD, we are discussing to move to mandoc(1) default
>> from -Tascii to -Tlocale, see the mail on <tech@openbsd.org>
>> below.
>> 
>> How do you feel about that idea, in particular regarding other
>> operating systems like DragonFly, NetBSD, FreeBSD and from the
>> perspective of the pkgsrc packaging system?

> I've tried this on the NetBSD man ls(1) man page with
> LC_CTYPE=de_DE.UTF-8 and didn't see a difference.
> 
> # man ls > ls.default
> man: Formatting manual page...
> # mandoc -Tlocale /usr/share/man/man1/ls.1 > ls.locale
> # diff ls.*
> #
> 
> Ideas why, or is this expected?

No, it is not expected.

I just retried this sequence of commands on OpenBSD.
It works as expected using any of the following mandoc binaries:

 - the one built from OpenBSD base (built using the OpenBSD
   build system, not including compat glue)
 - the one built from mdocml.bsd.lv HEAD (built using the
   portable build system, including compat glue)
 - the one built from the mdocml.bsd.lv VERSION_1_12 branch
   (built using the portable build system, including compat glue)

When running with -Tlocale, all three mandoc binaries produce
output where the "``" and "''" double quotes in the NetBSD ls(1)
manual page (checked out from NetBSD CVS, src/bin/ls/ls.1)
are rendered as a single UTF-8 character, specifically:

$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE=de_DE.UTF-8
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_MESSAGES="C"
LC_ALL=
$ mandoc -Tlocale ls.1 | hexdump -C | grep -A1 digit            
00000470  63 20 64 69 67 69 74 20  e2 80 9c 6f 6e 65 e2 80  |c digit ...one..|
00000480  9d 2e 29 20 46 6f 72 63  65 20 6f 75 74 70 75 74  |..) Force output|

To debug your problem, i'd suggest to first find out what exactly is
broken for you.

Does -Tutf8 output differ from -Tascii output?

If output is not different even with -Tutf8, UTF-8 output itself
is likely to be broken, as opposed to locale detection.
In that case, i'd recommend to check what the value of the USE_WCHAR
preprocessor #define is while compiling the file term_ascii.c.

If it is different, locale detection is likely to be broken.
In that case, i'd recommend to use gdb(1) to run "mandoc -Tlocale ls.1"
while LC_CTYPE=de_DE.UTF-8 is set and find out, in the file term_ascii.c,
function ascii_init(), what the value of the local variable "v" is
right after this function call:  setlocale(LC_ALL, "")

> One thing I remember being broken at some point: Does this still allow
> examples to be copied, or do we have to be extra careful about marking
> them up then?

Yes.  Plain '-' as an input character is rendered as an UTF-8 hyphen:

$ mandoc -Tlocale ls.1 | hexdump -C | head -n 7 | tail -n 2
00000050  4e 08 4e 41 08 41 4d 08  4d 45 08 45 0a 20 20 20  |N.NA.AM.ME.E.   |
00000060  20 20 6c 08 6c 73 08 73  20 e2 80 93 20 6c 69 73  |  l.ls.s ... lis|

However, the input string "\-" is rendered as a plain ASCII minus sign,
even with -Tutf8:

$ mandoc -Tlocale ls.1 | hexdump -C | head -n 70 | tail -n 3 
00000430  0a 0a 20 20 20 20 20 54  68 65 20 6f 70 74 69 6f  |..     The optio|
00000440  6e 73 20 61 72 65 20 61  73 20 66 6f 6c 6c 6f 77  |ns are as follow|
00000450  73 3a 0a 0a 20 20 20 20  20 2d 08 2d 31 08 31 20  |s:..     -.-1.1 |

If i understand correctly, that is usual typographical convention
in roff typesetting.

> At some point (sorry, I don't remember details, not even if it was
> mandoc or groff) I had the annoying state where 'man foo' replaced
> dashes with some UTF-8 dash that the shell didn't accept as when
> pasting it in a shell.

Yes, that can happen.

Actually, groff does exactly the same, and it does so by default:

$ echo $LC_CTYPE
de_DE.UTF-8
$ nroff -mandoc -c ls.1 | hexdump -C | head -n 7 | tail -n 2      
00000050  4e 08 4e 41 08 41 4d 08  4d 45 08 45 0a 20 20 20  |N.NA.AM.ME.E.   |
00000060  20 20 6c 08 6c 73 08 73  20 e2 80 93 20 6c 69 73  |  l.ls.s ... lis|
$ nroff -mandoc -c ls.1 | hexdump -C | head -n 70 | tail -n 3      
00000430  0a 0a 20 20 20 20 20 54  68 65 20 6f 70 74 69 6f  |..     The optio|
00000440  6e 73 20 61 72 65 20 61  73 20 66 6f 6c 6c 6f 77  |ns are as follow|
00000450  73 3a 0a 0a 20 20 20 20  20 2d 08 2d 31 08 31 20  |s:..     -.-1.1 |

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FWD: man.conf mandoc -Tlocale
  2014-02-15  9:42     ` Ingo Schwarze
@ 2014-02-16 20:56       ` Ingo Schwarze
  2014-02-17 11:41         ` Ulrich Spörlein
  2014-03-13  9:16       ` Thomas Klausner
  1 sibling, 1 reply; 7+ messages in thread
From: Ingo Schwarze @ 2014-02-16 20:56 UTC (permalink / raw)
  To: Thomas Klausner; +Cc: discuss

Hi Thomas,

Dmitrij D. Czarkoff just pointed out to me in private mail that
my analysis wasn't quite right, so i reinvestigated, and i have
to correct this part:

Ingo Schwarze wrote on Sat, Feb 15, 2014 at 10:42:51AM +0100:
> Thomas Klausner wrote on Sat, Feb 15, 2014 at 09:43:09AM +0100:

>> One thing I remember being broken at some point: Does this still allow
>> examples to be copied, or do we have to be extra careful about marking
>> them up then?

> Yes.  Plain '-' as an input character is rendered as an UTF-8 hyphen:

That is *not* true.  Plain '-' always renders as plain '-'.

> $ mandoc -Tlocale ls.1 | hexdump -C | head -n 7 | tail -n 2
> 00000050  4e 08 4e 41 08 41 4d 08  4d 45 08 45 0a 20 20 20  |N.NA.AM.ME.E.   |
> 00000060  20 20 6c 08 6c 73 08 73  20 e2 80 93 20 6c 69 73  |  l.ls.s ... lis|

The reason for this is that we use \(en between .Nm and .Nd
in the NAME section, not a plain '-'.

> However, the input string "\-" is rendered as a plain ASCII minus sign,
> even with -Tutf8:
> 
> $ mandoc -Tlocale ls.1 | hexdump -C | head -n 70 | tail -n 3 
> 00000430  0a 0a 20 20 20 20 20 54  68 65 20 6f 70 74 69 6f  |..     The optio|
> 00000440  6e 73 20 61 72 65 20 61  73 20 66 6f 6c 6c 6f 77  |ns are as follow|
> 00000450  73 3a 0a 0a 20 20 20 20  20 2d 08 2d 31 08 31 20  |s:..     -.-1.1 |

That part is correct.

So, we have these mappings:

   input   output
   -----   ASCII    UTF-8
           -----    -----

       -   -        -
      \-   -        -
    \(hy   -        U+2010
    \(en   -        U+2013
    \(em   --       U+2014

See also these lines in chars.in:

CHAR("-",  "-",  45)
CHAR("hy", "-",  8208)
CHAR("en", "-",  8211)
CHAR("em", "--", 8212)

So, unless people put \(hy, \(en, or \(em into their example code,
i would expect copy and paste to work just fine even in UTF-8 mode.

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FWD: man.conf mandoc -Tlocale
  2014-02-16 20:56       ` Ingo Schwarze
@ 2014-02-17 11:41         ` Ulrich Spörlein
  2014-02-17 11:55           ` Anthony J. Bentley
  0 siblings, 1 reply; 7+ messages in thread
From: Ulrich Spörlein @ 2014-02-17 11:41 UTC (permalink / raw)
  To: discuss; +Cc: Thomas Klausner

On Sun, 2014-02-16 at 21:56:55 +0100, Ingo Schwarze wrote:
> So, we have these mappings:
> 
>    input   output
>    -----   ASCII    UTF-8
>            -----    -----
> 
>        -   -        -
>       \-   -        -
>     \(hy   -        U+2010
>     \(en   -        U+2013
>     \(em   --       U+2014
> 
> See also these lines in chars.in:
> 
> CHAR("-",  "-",  45)
> CHAR("hy", "-",  8208)
> CHAR("en", "-",  8211)
> CHAR("em", "--", 8212)
> 
> So, unless people put \(hy, \(en, or \(em into their example code,
> i would expect copy and paste to work just fine even in UTF-8 mode.

I don't think hyphens will be the problem, but quotes, where people
might have used .Dq when they actually want the literal ASCII quotes ""
as they are to be used in some shell or other code.

In any case, -Tlocale should be the default and is the right thing to
do, IMHO.

Cheers,
Uli
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FWD: man.conf mandoc -Tlocale
  2014-02-17 11:41         ` Ulrich Spörlein
@ 2014-02-17 11:55           ` Anthony J. Bentley
  0 siblings, 0 replies; 7+ messages in thread
From: Anthony J. Bentley @ 2014-02-17 11:55 UTC (permalink / raw)
  To: discuss

On Mon, Feb 17, 2014 at 4:41 AM, Ulrich Spörlein <uqs@spoerlein.net> wrote:
> On Sun, 2014-02-16 at 21:56:55 +0100, Ingo Schwarze wrote:
>> So, we have these mappings:
>>
>>    input   output
>>    -----   ASCII    UTF-8
>>            -----    -----
>>
>>        -   -        -
>>       \-   -        -
>>     \(hy   -        U+2010
>>     \(en   -        U+2013
>>     \(em   --       U+2014
>>
>> See also these lines in chars.in:
>>
>> CHAR("-",  "-",  45)
>> CHAR("hy", "-",  8208)
>> CHAR("en", "-",  8211)
>> CHAR("em", "--", 8212)
>>
>> So, unless people put \(hy, \(en, or \(em into their example code,
>> i would expect copy and paste to work just fine even in UTF-8 mode.
>
> I don't think hyphens will be the problem, but quotes, where people
> might have used .Dq when they actually want the literal ASCII quotes ""
> as they are to be used in some shell or other code.

Thankfully, in my experience (using -Tlocale for well over a year)
this does not happen often--indeed, I haven't seen it at all. Any such
cases would be a bug in the manpage, but this is not a bug that would
be common enough to make -Tlocale an impractical default.

On the other hand, groff's hyphen substitution is extremely irritating
when it's turned on, since you'd be hard pressed to find a manual
*not* affected by it.

-- 
Anthony J. Bentley

--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FWD: man.conf mandoc -Tlocale
  2014-02-15  9:42     ` Ingo Schwarze
  2014-02-16 20:56       ` Ingo Schwarze
@ 2014-03-13  9:16       ` Thomas Klausner
  1 sibling, 0 replies; 7+ messages in thread
From: Thomas Klausner @ 2014-03-13  9:16 UTC (permalink / raw)
  To: Ingo Schwarze; +Cc: discuss

Hi Ingo!

On Sat, Feb 15, 2014 at 10:42:51AM +0100, Ingo Schwarze wrote:
> > I've tried this on the NetBSD man ls(1) man page with
> > LC_CTYPE=de_DE.UTF-8 and didn't see a difference.
> > 
> > # man ls > ls.default
> > man: Formatting manual page...
> > # mandoc -Tlocale /usr/share/man/man1/ls.1 > ls.locale
> > # diff ls.*
> > #
> > 
> > Ideas why, or is this expected?
> 
> No, it is not expected.

It was the simple problem that USE_WCHAR was not defined when building
mandoc on NetBSD. I've changed that now.

The version from pkgsrc worked fine before that.

So both -Tlocale and -Tutf8 now produce UTF-8 double quote characters
when run on ls(1).

Thanks,
 Thomas
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-03-13  9:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <sfid-H20140214-152923-+048.24-1@spamfilter.osbf.lua>
2014-02-14 13:06 ` FWD: man.conf mandoc -Tlocale Ingo Schwarze
2014-02-15  8:43   ` Thomas Klausner
2014-02-15  9:42     ` Ingo Schwarze
2014-02-16 20:56       ` Ingo Schwarze
2014-02-17 11:41         ` Ulrich Spörlein
2014-02-17 11:55           ` Anthony J. Bentley
2014-03-13  9:16       ` Thomas Klausner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).