From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: * X-Spam-Status: No, score=1.4 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 Received: (qmail 1962 invoked from network); 24 Mar 2022 17:13:40 -0000 Received: from bsd.lv (HELO mandoc.bsd.lv) (66.111.2.12) by inbox.vuxu.org with ESMTPUTF8; 24 Mar 2022 17:13:40 -0000 Received: from fantadrom.bsd.lv (localhost [127.0.0.1]) by mandoc.bsd.lv (OpenSMTPD) with ESMTP id 4e720eef for ; Thu, 24 Mar 2022 12:13:37 -0500 (EST) Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) by mandoc.bsd.lv (OpenSMTPD) with ESMTP id bee22892 for ; Thu, 24 Mar 2022 12:13:36 -0500 (EST) Received: by mail-pj1-f50.google.com with SMTP id o3-20020a17090a3d4300b001c6bc749227so5733587pjf.1 for ; Thu, 24 Mar 2022 10:13:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:from:date:message-id:subject:to :content-transfer-encoding; bh=VC2SbL7gNb4aRPA0Hj4djkbaFG2S92xXm3g/g+gDAkE=; b=qwElvoPNZidp1c+wjizbiAnzeYd+dPg483akWLChmjfDlK8ZuidRBTXg+UVS+gh+6N 1HqVCpVRMC2uzthydakCzfjnV2dTxdAe9+1x39l64G7JOq+Kn7HSQ3CRL/nlSxtQwJd8 4b9/pV2bzFlvULywfV1dEgHq8Q+a6A23KwqShSvhPQGyHdH0kQovZjOTgOdueasVXvtw SriymR0pTWHYv8LYjnm3zdmNB/8lfDhJud6CeHcRYdjHd9jmcdLYCu/wnNaCDB7VjWvO +ASChkywxXWBrtDN885IpXottJh6mJOnfblFbCDXdmhxGTY58+l2s5V2OlW05QzbZuUD sm7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-transfer-encoding; bh=VC2SbL7gNb4aRPA0Hj4djkbaFG2S92xXm3g/g+gDAkE=; b=TFakaHn97Qw1JTFCwXYLxQf5l391O8dJfirnyNzKGxc3A25x7SvBn4aJoq42sX/1Ir oXLYAPZSqLWuCiQ2aXMcDPyYfWf9OidT8HXqtZajdfZCB6lKNreKbMUmqfHTOxF5igKb 9Ej7kSRomQg3taddZA1WmR14ZJK/3AYtlk6Zu7AmW8XLnKlX12+dp2pqtk9pkBCN5hTy 3nZiUYn/xOdLcrksxqnH2b8q9CJzQ+PyoxCN1mAGYO9ezBHSGhLctrS7/bfuJk//agZm 5dope0O3tMHEbnYU0I4eVtfzjgfHtGAQ41m5JAjX8yT7SA/MqPtUKyyKzUkhuPjI+hu4 hEhg== X-Gm-Message-State: AOAM531zal9xZRh8YOn+VinnXXBmvyIXLQjzPoNEHVt05V/y9D3TQL4o Nz/XNLtxZsWagmwZPKixia1PPslsbc7GHQhuyyJQGoBX X-Google-Smtp-Source: ABdhPJw4gS8McdjCEWM1dqZfk3AZF9tGI7lvWbUj+GonmSjpYMOWpXSCRgVMjZG8HKwqykh0PhnBDq7tb70wqh+fy5w= X-Received: by 2002:a17:902:834a:b0:14f:3337:35de with SMTP id z10-20020a170902834a00b0014f333735demr6998912pln.8.1648142015240; Thu, 24 Mar 2022 10:13:35 -0700 (PDT) X-Mailinglist: mandoc-discuss Reply-To: discuss@mandoc.bsd.lv MIME-Version: 1.0 From: =?UTF-8?Q?Mario_Bl=C3=A4ttermann?= Date: Thu, 24 Mar 2022 18:13:23 +0100 Message-ID: Subject: HTML output: section headers with diacritics not in table of contents To: discuss@mandoc.bsd.lv Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello, recently I'm switched from GNU man-db to mandoc. It's really a big step ahead, especially regarding the creation of HTML pages, but it has its own peculiarities =E2=80=A6 For creating a HTML man page I use the following command: mandoc -T html -O toc ./manpage.1 > manpage.1.html This works so far for English man pages. For man pages in other languages, I stumbled upon problems with creating toc entries. For example, the "SYNOPSIS" is "=C3=9CBERSICHT" in German, and the "=C3=9C" is displayed correctly, but the header is not clickable because it doesn't have a toc entry. You can see this in the Archlinux online man pages [1]; as you might know, "Archmanweb" uses Mandoc. The German keyboard produces the letter "=C3=9C" as a single character named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of splitted "=C3=9C" available: U U+0055 LATIN CAPITAL LETTER U =E2=80=8E=CC= =88 U+0308 COMBINING DIAERESIS. If I change it in the Groff source, toc creation works fine using this splitted one. Moreover, in the Vietnamese version of the same man page [2], even more toc entries are missing. Obviously because multiple section headers start with "T", followed by diacritics, no toc entry is created for those. But interesting: T=C3=93M T=E1=BA=AET doesn't have an en= try, T=C3=8AN does have one. I can't imagine how Mandoc distinguishes between acceptable and unacceptable diacritics. The described behavior is the same with a pure Mandoc on my local system and with Archmanweb. However, the developers of Debiman obviously found a solution [3], maybe unconsciously =E2=80=A6? In any case, their Mandoc is wrapped in a Go-based environment. Besides some extra features Archmanweb doesn't have (for example, better detection of cross-references to other man pages if they are not formatted as such), the toc creation works, even for the Vietnamese version [4]. Any idea what is wrong? Well, first I thought the problem is on my machine, but Archmanweb shows the same behavior. As a workaround, I could produce a few more toc entries by replacing "=C3=9C" with "U=CC=88" a= nd similar, but as long as I don't know what rules Mandoc applies internally, it's almost impossible to fix. To mention, as one of the maintainers of the manpages-l10n project [5], I have to maintain many languages, not only my own one =E2=80=A6 I consider the online collections just as important as the local versions, especially for linking to a specific man page section or subsection in email or web, and for searching in man pages which are not installed locally. Any help with solving this problem would be appreciated. [1] https://man.archlinux.org/man/diff.1.de [2] https://man.archlinux.org/man/diff.1.vi [3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html [4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz [5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n Best Regards, Mario -- To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv