discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed
From: Ingo Schwarze <schwarze@usta.de>
To: Thomas Klausner <wiz@NetBSD.org>
Cc: discuss@mdocml.bsd.lv
Subject: Re: FWD: man.conf mandoc -Tlocale
Date: Sat, 15 Feb 2014 10:42:51 +0100	[thread overview]
Message-ID: <20140215094251.GA24366@iris.usta.de> (raw)
In-Reply-To: <20140215084309.GA14964@danbala.tuwien.ac.at>

Hi Thomas,

Thomas Klausner wrote on Sat, Feb 15, 2014 at 09:43:09AM +0100:
> On Fri, Feb 14, 2014 at 02:06:47PM +0100, Ingo Schwarze wrote:

>> in OpenBSD, we are discussing to move to mandoc(1) default
>> from -Tascii to -Tlocale, see the mail on <tech@openbsd.org>
>> below.
>> 
>> How do you feel about that idea, in particular regarding other
>> operating systems like DragonFly, NetBSD, FreeBSD and from the
>> perspective of the pkgsrc packaging system?

> I've tried this on the NetBSD man ls(1) man page with
> LC_CTYPE=de_DE.UTF-8 and didn't see a difference.
> 
> # man ls > ls.default
> man: Formatting manual page...
> # mandoc -Tlocale /usr/share/man/man1/ls.1 > ls.locale
> # diff ls.*
> #
> 
> Ideas why, or is this expected?

No, it is not expected.

I just retried this sequence of commands on OpenBSD.
It works as expected using any of the following mandoc binaries:

 - the one built from OpenBSD base (built using the OpenBSD
   build system, not including compat glue)
 - the one built from mdocml.bsd.lv HEAD (built using the
   portable build system, including compat glue)
 - the one built from the mdocml.bsd.lv VERSION_1_12 branch
   (built using the portable build system, including compat glue)

When running with -Tlocale, all three mandoc binaries produce
output where the "``" and "''" double quotes in the NetBSD ls(1)
manual page (checked out from NetBSD CVS, src/bin/ls/ls.1)
are rendered as a single UTF-8 character, specifically:

$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE=de_DE.UTF-8
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_MESSAGES="C"
LC_ALL=
$ mandoc -Tlocale ls.1 | hexdump -C | grep -A1 digit            
00000470  63 20 64 69 67 69 74 20  e2 80 9c 6f 6e 65 e2 80  |c digit ...one..|
00000480  9d 2e 29 20 46 6f 72 63  65 20 6f 75 74 70 75 74  |..) Force output|

To debug your problem, i'd suggest to first find out what exactly is
broken for you.

Does -Tutf8 output differ from -Tascii output?

If output is not different even with -Tutf8, UTF-8 output itself
is likely to be broken, as opposed to locale detection.
In that case, i'd recommend to check what the value of the USE_WCHAR
preprocessor #define is while compiling the file term_ascii.c.

If it is different, locale detection is likely to be broken.
In that case, i'd recommend to use gdb(1) to run "mandoc -Tlocale ls.1"
while LC_CTYPE=de_DE.UTF-8 is set and find out, in the file term_ascii.c,
function ascii_init(), what the value of the local variable "v" is
right after this function call:  setlocale(LC_ALL, "")

> One thing I remember being broken at some point: Does this still allow
> examples to be copied, or do we have to be extra careful about marking
> them up then?

Yes.  Plain '-' as an input character is rendered as an UTF-8 hyphen:

$ mandoc -Tlocale ls.1 | hexdump -C | head -n 7 | tail -n 2
00000050  4e 08 4e 41 08 41 4d 08  4d 45 08 45 0a 20 20 20  |N.NA.AM.ME.E.   |
00000060  20 20 6c 08 6c 73 08 73  20 e2 80 93 20 6c 69 73  |  l.ls.s ... lis|

However, the input string "\-" is rendered as a plain ASCII minus sign,
even with -Tutf8:

$ mandoc -Tlocale ls.1 | hexdump -C | head -n 70 | tail -n 3 
00000430  0a 0a 20 20 20 20 20 54  68 65 20 6f 70 74 69 6f  |..     The optio|
00000440  6e 73 20 61 72 65 20 61  73 20 66 6f 6c 6c 6f 77  |ns are as follow|
00000450  73 3a 0a 0a 20 20 20 20  20 2d 08 2d 31 08 31 20  |s:..     -.-1.1 |

If i understand correctly, that is usual typographical convention
in roff typesetting.

> At some point (sorry, I don't remember details, not even if it was
> mandoc or groff) I had the annoying state where 'man foo' replaced
> dashes with some UTF-8 dash that the shell didn't accept as when
> pasting it in a shell.

Yes, that can happen.

Actually, groff does exactly the same, and it does so by default:

$ echo $LC_CTYPE
de_DE.UTF-8
$ nroff -mandoc -c ls.1 | hexdump -C | head -n 7 | tail -n 2      
00000050  4e 08 4e 41 08 41 4d 08  4d 45 08 45 0a 20 20 20  |N.NA.AM.ME.E.   |
00000060  20 20 6c 08 6c 73 08 73  20 e2 80 93 20 6c 69 73  |  l.ls.s ... lis|
$ nroff -mandoc -c ls.1 | hexdump -C | head -n 70 | tail -n 3      
00000430  0a 0a 20 20 20 20 20 54  68 65 20 6f 70 74 69 6f  |..     The optio|
00000440  6e 73 20 61 72 65 20 61  73 20 66 6f 6c 6c 6f 77  |ns are as follow|
00000450  73 3a 0a 0a 20 20 20 20  20 2d 08 2d 31 08 31 20  |s:..     -.-1.1 |

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

  reply	other threads:[~2014-02-15  9:42 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <sfid-H20140214-152923-+048.24-1@spamfilter.osbf.lua>
2014-02-14 13:06 ` Ingo Schwarze
2014-02-15  8:43   ` Thomas Klausner
2014-02-15  9:42     ` Ingo Schwarze [this message]
2014-02-16 20:56       ` Ingo Schwarze
2014-02-17 11:41         ` Ulrich Spörlein
2014-02-17 11:55           ` Anthony J. Bentley
2014-03-13  9:16       ` Thomas Klausner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140215094251.GA24366@iris.usta.de \
    --to=schwarze@usta.de \
    --cc=discuss@mdocml.bsd.lv \
    --cc=wiz@NetBSD.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).