From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=HTML_MESSAGE, T_SCC_BODY_TEXT_LINE,T_TVD_MIME_EPI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 4723 invoked from network); 24 Mar 2022 17:34:11 -0000 Received: from bsd.lv (HELO mandoc.bsd.lv) (66.111.2.12) by inbox.vuxu.org with ESMTPUTF8; 24 Mar 2022 17:34:11 -0000 Received: from fantadrom.bsd.lv (localhost [127.0.0.1]) by mandoc.bsd.lv (OpenSMTPD) with ESMTP id 30a73295 for ; Thu, 24 Mar 2022 12:34:08 -0500 (EST) Received: from mail-oi1-f174.google.com (mail-oi1-f174.google.com [209.85.167.174]) by mandoc.bsd.lv (OpenSMTPD) with ESMTP id 59b2a23e for ; Thu, 24 Mar 2022 12:34:07 -0500 (EST) Received: by mail-oi1-f174.google.com with SMTP id 12so5573887oix.12 for ; Thu, 24 Mar 2022 10:34:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=8di197p9OrZ39FsHHWLx+akuMUEAp0Z7lasH10Mc2ZU=; b=swcQvHuaVYh/at+EPy/J4NaDYTnAZUuqtW2M7DtIghu+yprjeFWYdcBXEfS8Jbw64R hdx9fFOtx2c3I4YHlI4aLAQzfVPutvTjS0IpQdFrEAeknrzgMg+EExy0y2KcW52KGOJd +5gN5ikRJ4EqXjfYxANdlD3wzTLYr0ij+4niDO30gAC/fAn+u9n+Az7sV22ah91A/jDz LwIjlBd7EH8IFBWNg1WZeQaC+P/TPpjtlPiFVeL8OTzMRBn91KirJuaIz+pOU9NE2iF3 fgRuZXbd+HlFeIUwV3x5Mle3OKRrTF6Slh7V9Ne1Ra8O7Yz3cI+uHKNz33qXU8uoYsfz +ryg== X-Gm-Message-State: AOAM530YmBV6JZn3Xkb6hVCYDzx2wR6XMRQb7avvJA+Wf2jVwqtGMx5d 5WVeknogr6pJjuxkAMubNwQGtvkwpWYq0BDPJth1EchdVEc= X-Google-Smtp-Source: ABdhPJw1+4X38bPg2EjYEHCeUh1bg1CB7N+0Mb8Xg9SfKww3wy6WfzRSyd3n+OL6FMPxSsMLiCTX9tetzkjoFFiOA0w= X-Received: by 2002:aca:2b0b:0:b0:2da:3ed3:f862 with SMTP id i11-20020aca2b0b000000b002da3ed3f862mr7873677oik.65.1648143246733; Thu, 24 Mar 2022 10:34:06 -0700 (PDT) X-Mailinglist: mandoc-discuss Reply-To: discuss@mandoc.bsd.lv MIME-Version: 1.0 References: In-Reply-To: From: Michael Stapelberg Date: Thu, 24 Mar 2022 18:33:50 +0100 Message-ID: Subject: Re: HTML output: section headers with diacritics not in table of contents To: discuss@mandoc.bsd.lv Content-Type: multipart/alternative; boundary="000000000000c2c9c305dafa4049" --000000000000c2c9c305dafa4049 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, 24 Mar 2022 at 18:13, Mario Bl=C3=A4ttermann < mario.blaettermann@gmail.com> wrote: > Hello, > > recently I'm switched from GNU man-db to mandoc. It's really a big > step ahead, especially regarding the creation of HTML pages, but it > has its own peculiarities =E2=80=A6 > > For creating a HTML man page I use the following command: > > mandoc -T html -O toc ./manpage.1 > manpage.1.html > > This works so far for English man pages. For man pages in other > languages, I stumbled upon problems with creating toc entries. For > example, the "SYNOPSIS" is "=C3=9CBERSICHT" in German, and the "=C3=9C" i= s > displayed correctly, but the header is not clickable because it > doesn't have a toc entry. You can see this in the Archlinux online man > pages [1]; as you might know, "Archmanweb" uses Mandoc. > > The German keyboard produces the letter "=C3=9C" as a single character > named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of > splitted "=C3=9C" available: U U+0055 LATIN CAPITAL LETTER U =E2=80=8E=CC= =88 U+0308 > COMBINING DIAERESIS. If I change it in the Groff source, toc creation > works fine using this splitted one. > > Moreover, in the Vietnamese version of the same man page [2], even > more toc entries are missing. Obviously because multiple section > headers start with "T", followed by diacritics, no toc entry is > created for those. But interesting: T=C3=93M T=E1=BA=AET doesn't have an = entry, T=C3=8AN > does have one. I can't imagine how Mandoc distinguishes between > acceptable and unacceptable diacritics. > > The described behavior is the same with a pure Mandoc on my local > system and with Archmanweb. However, the developers of Debiman > obviously found a solution [3], maybe unconsciously =E2=80=A6? In any cas= e, > their Mandoc is wrapped in a Go-based environment. Besides some extra > features Archmanweb doesn't have (for example, better detection of > cross-references to other man pages if they are not formatted as > such), the toc creation works, even for the Vietnamese version [4]. > Hello, I=E2=80=99m the author of debiman :) The reason why it uses a different TOC implementation is historical: debiman introduced a TOC in 2017, whereas mandoc itself only gained -O toc in 2018. I=E2=80=99m glad to hear that our code is unicode clean in that regard. Good unicode/internationalization was one of the project=E2=80=99s goals, and is easy to accomplish in Go. > > Any idea what is wrong? Well, first I thought the problem is on my > machine, but Archmanweb shows the same behavior. As a workaround, I > could produce a few more toc entries by replacing "=C3=9C" with "U=CC=88"= and > similar, but as long as I don't know what rules Mandoc applies > internally, it's almost impossible to fix. To mention, as one of the > maintainers of the manpages-l10n project [5], I have to maintain many > languages, not only my own one =E2=80=A6 > > I consider the online collections just as important as the local > versions, especially for linking to a specific man page section or > subsection in email or web, and for searching in man pages which are > not installed locally. Any help with solving this problem would be > appreciated. > > [1] https://man.archlinux.org/man/diff.1.de > [2] https://man.archlinux.org/man/diff.1.vi > [3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html > [4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz > [5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n > > Best Regards, > Mario > -- > To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv > > --=20 Best regards, Michael --000000000000c2c9c305dafa4049 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Thu, 24 Mar 2022 at 18:13, Mario B= l=C3=A4ttermann <mario.b= laettermann@gmail.com> wrote:
Hello,

recently I'm switched from GNU man-db to mandoc. It's really a big<= br> step ahead, especially regarding the creation of HTML pages, but it
has its own peculiarities =E2=80=A6

For creating a HTML man page I use the following command:

mandoc -T html -O toc ./manpage.1 > manpage.1.html

This works so far for English man pages. For man pages in other
languages, I stumbled upon problems with creating toc entries. For
example, the "SYNOPSIS" is "=C3=9CBERSICHT" in German, = and the "=C3=9C" is
displayed correctly, but the header is not clickable because it
doesn't have a toc entry. You can see this in the Archlinux online man<= br> pages [1]; as you might know, "Archmanweb" uses Mandoc.

The German keyboard produces the letter "=C3=9C" as a single char= acter
named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a = kind of
splitted "=C3=9C" available: U U+0055 LATIN CAPITAL LETTER U =E2= =80=8E=CC=88 U+0308
COMBINING DIAERESIS. If I change it in the Groff source, toc creation
works fine using this splitted one.

Moreover, in the Vietnamese version of the same man page [2], even
more toc entries are missing. Obviously because multiple section
headers start with "T", followed by diacritics, no toc entry is created for those. But interesting: T=C3=93M T=E1=BA=AET doesn't have a= n entry, T=C3=8AN
does have one. I can't imagine how Mandoc distinguishes between
acceptable and unacceptable diacritics.

The described behavior is the same with a pure Mandoc on my local
system and with Archmanweb. However, the developers of Debiman
obviously found a solution [3], maybe unconsciously =E2=80=A6? In any case,=
their Mandoc is wrapped in a Go-based environment. Besides some extra
features Archmanweb doesn't have (for example, better detection of
cross-references to other man pages if they are not formatted as
such), the toc creation works, even for the Vietnamese version [4].

Hello, I=E2=80=99m the author of debiman :)
The reason why it uses a different TOC implementation is historical= :
debiman introduced a TOC in 2017, whereas mandoc itself only ga= ined -O toc in 2018.

I=E2=80=99m glad to hear that= our code is unicode clean in that regard.
Good unicode/internati= onalization was one of the project=E2=80=99s goals,
and is easy t= o accomplish in Go.
=C2=A0

Any idea what is wrong? Well, first I thought the problem is on my
machine, but Archmanweb shows the same behavior. As a workaround, I
could produce a few more toc entries by replacing "=C3=9C" with &= quot;U=CC=88" and
similar, but as long as I don't know what rules Mandoc applies
internally, it's almost impossible to fix. To mention, as one of the maintainers of the manpages-l10n project [5], I have to maintain many
languages, not only my own one =E2=80=A6

I consider the online collections just as important as the local
versions, especially for linking to a specific man page section or
subsection in email or web, and for searching in man pages which are
not installed locally. Any help with solving this problem would be
appreciated.

[1] https://man.archlinux.org/man/diff.1.de
[2] https://man.archlinux.org/man/diff.1.vi
[3] https://manpages.debian.org/bulls= eye/manpages-de/diff.1.de.html
[4] https://manpages.debian.org/un= stable/manpages-vi/diff.1.vi.html.gz
[5] https://salsa.debian.org/manpages-l10n-= team/manpages-l10n

Best Regards,
Mario
--
=C2=A0To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


--
Best regards,
Michael
--000000000000c2c9c305dafa4049-- -- To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv