On Thu, 24 Mar 2022 at 18:13, Mario Blättermann <mario.blaettermann@gmail.com> wrote:
Hello,

recently I'm switched from GNU man-db to mandoc. It's really a big
step ahead, especially regarding the creation of HTML pages, but it
has its own peculiarities …

For creating a HTML man page I use the following command:

mandoc -T html -O toc ./manpage.1 > manpage.1.html

This works so far for English man pages. For man pages in other
languages, I stumbled upon problems with creating toc entries. For
example, the "SYNOPSIS" is "ÜBERSICHT" in German, and the "Ü" is
displayed correctly, but the header is not clickable because it
doesn't have a toc entry. You can see this in the Archlinux online man
pages [1]; as you might know, "Archmanweb" uses Mandoc.

The German keyboard produces the letter "Ü" as a single character
named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of
splitted "Ü" available: U U+0055 LATIN CAPITAL LETTER U ‎̈ U+0308
COMBINING DIAERESIS. If I change it in the Groff source, toc creation
works fine using this splitted one.

Moreover, in the Vietnamese version of the same man page [2], even
more toc entries are missing. Obviously because multiple section
headers start with "T", followed by diacritics, no toc entry is
created for those. But interesting: TÓM TẮT doesn't have an entry, TÊN
does have one. I can't imagine how Mandoc distinguishes between
acceptable and unacceptable diacritics.

The described behavior is the same with a pure Mandoc on my local
system and with Archmanweb. However, the developers of Debiman
obviously found a solution [3], maybe unconsciously …? In any case,
their Mandoc is wrapped in a Go-based environment. Besides some extra
features Archmanweb doesn't have (for example, better detection of
cross-references to other man pages if they are not formatted as
such), the toc creation works, even for the Vietnamese version [4].

Hello, I’m the author of debiman :)
The reason why it uses a different TOC implementation is historical:
debiman introduced a TOC in 2017, whereas mandoc itself only gained -O toc in 2018.

I’m glad to hear that our code is unicode clean in that regard.
Good unicode/internationalization was one of the project’s goals,
and is easy to accomplish in Go.
 

Any idea what is wrong? Well, first I thought the problem is on my
machine, but Archmanweb shows the same behavior. As a workaround, I
could produce a few more toc entries by replacing "Ü" with "Ü" and
similar, but as long as I don't know what rules Mandoc applies
internally, it's almost impossible to fix. To mention, as one of the
maintainers of the manpages-l10n project [5], I have to maintain many
languages, not only my own one …

I consider the online collections just as important as the local
versions, especially for linking to a specific man page section or
subsection in email or web, and for searching in man pages which are
not installed locally. Any help with solving this problem would be
appreciated.

[1] https://man.archlinux.org/man/diff.1.de
[2] https://man.archlinux.org/man/diff.1.vi
[3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html
[4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz
[5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n

Best Regards,
Mario
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv



--
Best regards,
Michael