From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailout.scc.kit.edu (mailout.scc.kit.edu [129.13.185.202]) by krisdoz.my.domain (8.14.5/8.14.5) with ESMTP id s1F9guTw000170 for ; Sat, 15 Feb 2014 04:42:57 -0500 (EST) Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82]) by scc-mailout-02.scc.kit.edu with esmtp (Exim 4.72 #1) id 1WEblr-000232-PY; Sat, 15 Feb 2014 10:42:51 +0100 Received: from donnerwolke.usta.de ([172.24.96.3]) by hekate.usta.de with esmtp (Exim 4.77) (envelope-from ) id 1WEblr-0002pl-OJ; Sat, 15 Feb 2014 10:42:51 +0100 Received: from iris.usta.de ([172.24.96.5] helo=usta.de) by donnerwolke.usta.de with esmtp (Exim 4.72) (envelope-from ) id 1WEblr-0001pZ-Lw; Sat, 15 Feb 2014 10:42:51 +0100 Received: from schwarze by usta.de with local (Exim 4.77) (envelope-from ) id 1WEblr-0004zj-Em; Sat, 15 Feb 2014 10:42:51 +0100 Date: Sat, 15 Feb 2014 10:42:51 +0100 From: Ingo Schwarze To: Thomas Klausner Cc: discuss@mdocml.bsd.lv Subject: Re: FWD: man.conf mandoc -Tlocale Message-ID: <20140215094251.GA24366@iris.usta.de> References: <20140214130647.GF20867@iris.usta.de> <20140215084309.GA14964@danbala.tuwien.ac.at> X-Mailinglist: mdocml-discuss Reply-To: discuss@mdocml.bsd.lv MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140215084309.GA14964@danbala.tuwien.ac.at> User-Agent: Mutt/1.5.21 (2010-09-15) Hi Thomas, Thomas Klausner wrote on Sat, Feb 15, 2014 at 09:43:09AM +0100: > On Fri, Feb 14, 2014 at 02:06:47PM +0100, Ingo Schwarze wrote: >> in OpenBSD, we are discussing to move to mandoc(1) default >> from -Tascii to -Tlocale, see the mail on >> below. >> >> How do you feel about that idea, in particular regarding other >> operating systems like DragonFly, NetBSD, FreeBSD and from the >> perspective of the pkgsrc packaging system? > I've tried this on the NetBSD man ls(1) man page with > LC_CTYPE=de_DE.UTF-8 and didn't see a difference. > > # man ls > ls.default > man: Formatting manual page... > # mandoc -Tlocale /usr/share/man/man1/ls.1 > ls.locale > # diff ls.* > # > > Ideas why, or is this expected? No, it is not expected. I just retried this sequence of commands on OpenBSD. It works as expected using any of the following mandoc binaries: - the one built from OpenBSD base (built using the OpenBSD build system, not including compat glue) - the one built from mdocml.bsd.lv HEAD (built using the portable build system, including compat glue) - the one built from the mdocml.bsd.lv VERSION_1_12 branch (built using the portable build system, including compat glue) When running with -Tlocale, all three mandoc binaries produce output where the "``" and "''" double quotes in the NetBSD ls(1) manual page (checked out from NetBSD CVS, src/bin/ls/ls.1) are rendered as a single UTF-8 character, specifically: $ locale LANG= LC_COLLATE="C" LC_CTYPE=de_DE.UTF-8 LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_MESSAGES="C" LC_ALL= $ mandoc -Tlocale ls.1 | hexdump -C | grep -A1 digit 00000470 63 20 64 69 67 69 74 20 e2 80 9c 6f 6e 65 e2 80 |c digit ...one..| 00000480 9d 2e 29 20 46 6f 72 63 65 20 6f 75 74 70 75 74 |..) Force output| To debug your problem, i'd suggest to first find out what exactly is broken for you. Does -Tutf8 output differ from -Tascii output? If output is not different even with -Tutf8, UTF-8 output itself is likely to be broken, as opposed to locale detection. In that case, i'd recommend to check what the value of the USE_WCHAR preprocessor #define is while compiling the file term_ascii.c. If it is different, locale detection is likely to be broken. In that case, i'd recommend to use gdb(1) to run "mandoc -Tlocale ls.1" while LC_CTYPE=de_DE.UTF-8 is set and find out, in the file term_ascii.c, function ascii_init(), what the value of the local variable "v" is right after this function call: setlocale(LC_ALL, "") > One thing I remember being broken at some point: Does this still allow > examples to be copied, or do we have to be extra careful about marking > them up then? Yes. Plain '-' as an input character is rendered as an UTF-8 hyphen: $ mandoc -Tlocale ls.1 | hexdump -C | head -n 7 | tail -n 2 00000050 4e 08 4e 41 08 41 4d 08 4d 45 08 45 0a 20 20 20 |N.NA.AM.ME.E. | 00000060 20 20 6c 08 6c 73 08 73 20 e2 80 93 20 6c 69 73 | l.ls.s ... lis| However, the input string "\-" is rendered as a plain ASCII minus sign, even with -Tutf8: $ mandoc -Tlocale ls.1 | hexdump -C | head -n 70 | tail -n 3 00000430 0a 0a 20 20 20 20 20 54 68 65 20 6f 70 74 69 6f |.. The optio| 00000440 6e 73 20 61 72 65 20 61 73 20 66 6f 6c 6c 6f 77 |ns are as follow| 00000450 73 3a 0a 0a 20 20 20 20 20 2d 08 2d 31 08 31 20 |s:.. -.-1.1 | If i understand correctly, that is usual typographical convention in roff typesetting. > At some point (sorry, I don't remember details, not even if it was > mandoc or groff) I had the annoying state where 'man foo' replaced > dashes with some UTF-8 dash that the shell didn't accept as when > pasting it in a shell. Yes, that can happen. Actually, groff does exactly the same, and it does so by default: $ echo $LC_CTYPE de_DE.UTF-8 $ nroff -mandoc -c ls.1 | hexdump -C | head -n 7 | tail -n 2 00000050 4e 08 4e 41 08 41 4d 08 4d 45 08 45 0a 20 20 20 |N.NA.AM.ME.E. | 00000060 20 20 6c 08 6c 73 08 73 20 e2 80 93 20 6c 69 73 | l.ls.s ... lis| $ nroff -mandoc -c ls.1 | hexdump -C | head -n 70 | tail -n 3 00000430 0a 0a 20 20 20 20 20 54 68 65 20 6f 70 74 69 6f |.. The optio| 00000440 6e 73 20 61 72 65 20 61 73 20 66 6f 6c 6c 6f 77 |ns are as follow| 00000450 73 3a 0a 0a 20 20 20 20 20 2d 08 2d 31 08 31 20 |s:.. -.-1.1 | Yours, Ingo -- To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv