discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed
From: Ingo Schwarze <schwarze@usta.de>
To: mario.blaettermann@gmail.com
Cc: discuss@mandoc.bsd.lv
Subject: Re: HTML output: section headers with diacritics not in table of contents
Date: Fri, 25 Mar 2022 13:27:00 +0100	[thread overview]
Message-ID: <Yj21FDsRyhI4eOvN@asta-kit.de> (raw)
In-Reply-To: <CAHi0vA_pfSMOHCd+JuG0efwz=WYUDs2N4=Mgv+BzHcP62T-jAw@mail.gmail.com>

Hi Mario,

Mario Blättermann wrote on Thu, Mar 24, 2022 at 06:13:23PM +0100:

> recently I'm switched from GNU man-db to mandoc. It's really a big
> step ahead, especially regarding the creation of HTML pages, but it
> has its own peculiarities …
> 
> For creating a HTML man page I use the following command:
> 
> mandoc -T html -O toc ./manpage.1 > manpage.1.html

You should really use the -O style=... and -O man=... options in
addition to the options you are already using.  Without "style",
CSS support is next to absent; no real style sheet is linked to,
and only a minimal style sheet is embedded with <style>, so minimal
that many features cannot work.  Without "man", you get no hyperlinks
from .Xr macros.

> This works so far for English man pages. For man pages in other
> languages, I stumbled upon problems with creating toc entries.

Regarding -O toc, you should be aware that a number of senior OpenBSD
developers hated the feature so much that i disabled it completely
on man.openbsd.org.  They argue the feature is superfluous, noisy,
and the whole idea of prefixing a TOC to a manual page is misguided.

Being disabled by default, the feature matures only very slowly, and
i have to depend on your feedback, and on other people using mandoc
in a similar way as you are using it, to slowly improve the -O toc
feature.

> For example, the "SYNOPSIS" is "ÜBERSICHT" in German,

What a terrible mistranslation...

You know, German happens to be my native language, and in
general English usage, "synopsis" can indeed mean "Zusammenfassung,
Kurzfassung, Uebersicht", but in a manual page, "SYNOPSIS" really means
"short syntax display", so an exact name for the section would be
"Kurzdarstellung der Syntax" oder "Syntax-Kurzdarstellung" or some
similar wording; "Uebersicht" with no further qualification is badly
misleading.

Such mistranslations obviously not only happen for reserved words
like "SYNOPSIS", but also in the main text of manual pages.
That's why i hate translated manual pages so much.  Reading German
manual pages, i usually find them pretty unitelligible.

Multiple times, i talked to Japanese software developers and even
though fluency in English is much less common in Japan than, say, in
Sweden or the Netherlands or even in Germany, they consistently told
me that very few Japanese people use Japanese manual pages because
even with a limited understanding of English, Japanese programmers
tend to find out quickly that they are much better served by English
than by Japanese manual pages, so they tend to improve their English
reading stills until they understand what they need.

So while i'm not aggressively trying to *not* support translated manual
pages, i don't think translated manual pages are particularly relevant
either.

> and the "Ü" is displayed correctly,

Yes, mandoc aims to correctly handle LC_CTYPE=*.UTF-8.  If Unicode
characters are displayed incorrectly in UTF-8 or HTML output mode,
i consider that a bug because non-ASCII characters are also crucial
for the understanding of a limited number of English manual pages.
Here is the canonical example:

  https://man.openbsd.org/lgamma.3

> but the header is not clickable because it
> doesn't have a toc entry. You can see this in the Archlinux online man
> pages [1]; as you might know, "Archmanweb" uses Mandoc.
> https://man.archlinux.org/man/diff.1.de

Yes, i'm aware than Arch uses Mandoc :-).
https://mandoc.bsd.lv/ links to https://man.archlinux.org/
below "Documentation and help",
and so does https://mandoc.bsd.lv/ports.html,
saying that Arch uses mandoc for the web since 2021 Jan 14.


Handling non-ASCII tag content is non-trivial.
Similar to preconv(1), mandoc(1) neads to translate non-ASCII
input to roff(7) character escape sequences at input time.
So when trying to figure whether the string "Uebersicht"
should be tagged, we have this situation:

  Breakpoint 2, post_SH (man=0x6e29d380000, n=0x6e29d351000)
      at /usr/src/usr.bin/mandoc/man_validate.c:312
  312     {
  (gdb) n
  316             nc = n->child;
  (gdb) 
  317             switch (n->type) {
  (gdb) 
  319                     tag = NULL;
  (gdb) 
  320                     deroff(&tag, n);
  (gdb) 
  321                     if (tag != NULL) {
  (gdb) p tag
  $2 = 0x6e29d35c520 "\\[u00DC]BERSICHT"

Now, how is the mandoc tagging module supposed to figure out that
\\[u00DC] is a letter?  Using iswalpha(3) would be awkward at best
given that mandoc does not use wchar_t at all, and even more so because
resolving the \\[u00DC] sequence only happens at the output stage, so
a wchar_t or char32_t form of the character isn't really available to
the tagging module.  Consequenly, the tagging module treats \\[u00DC]
like any other roff(7) escape sequence, aborting tag processing.
Since tag generation gets aborted for the first letter of this word,
no tag is generated at all, which results in no HTML id= attribute
being generated for the h1 class="Sh" element, and hence no underlining
and no clickability.

This points to a deeper problem.  

Section names in manual pages serve a double function.  On the one
hand, they communicate to the reader which content to expect in the
section.  That purpose is partly defeated by translating a manual
page because translations tend to be inconsistent.  Some manual pages
translate "SYNOPSIS" to "Syntax" in German; this particular on to
"Uebersicht".  So the effect that the reader immediately recognizes
the section name is lost, even if (and in particular when) the reader
is used to reading German manual pages.  (The same applies to the
main text, of course; translations of technical terms are usually
inconsistent and often awkward and misleading, making the text hard
to understand even for native speakers of the target language.)

On the other hand, section names serve a syntactic function,
modifying the interpretation of macros and other content in
the section in question.  That syntactic function is *completely*
lost by translating the manual page.  For example, mandoc
implements special handling for the following sections:

  SEC_SYNOPSIS      (11 special features)
  SEC_NAME          (7 special features)
  SEC_LIBRARY       (5 special features)
  SEC_DESCRIPTION   (2 special features)
  SEC_RETURN_VALUES (1 special feature)
  SEC_ERRORS        (2 special features)
  SEC_SEE_ALSO      (7 special features)
  SEC_AUTHORS       (8 special features)

All that special processing is completely lost in a translated manual page.

> The German keyboard produces the letter "Ü" as a single character
> named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of
> splitted "Ü" available: U U+0055 LATIN CAPITAL LETTER U ‎̈ U+0308
> COMBINING DIAERESIS. If I change it in the Groff source, toc creation
> works fine using this splitted one.

Well, not really.  The two-character form of the umlaut is "U\\[u0308]"
instead of "\\[u00DC]", so the tag will be "U", which isn't really
useful either.

> Moreover, in the Vietnamese version of the same man page [2], even
> [2] https://man.archlinux.org/man/diff.1.vi
> more toc entries are missing. Obviously because multiple section
> headers start with "T", followed by diacritics, no toc entry is
> created for those. But interesting: TÓM TẮT doesn't have an entry, TÊN
> does have one. I can't imagine how Mandoc distinguishes between
> acceptable and unacceptable diacritics.

It doesn't.  If you have conflicting tag locations (in this case
two definitions of the term "T") and no other information which one
is more authoritative, the first one wins.  But in less(1), you
can still navigate both by typing ":t T" to navigate to the first
definition of the term "T", then hit the key "T" (next tag) to
navigate to the next definition of the term "T" (a bit awkward
we ended up with the term "T" in this example, which clashes with
the less(1) command "T" :).

> The described behavior is the same with a pure Mandoc on my local
> system and with Archmanweb. However, the developers of Debiman
> obviously found a solution [3], maybe unconsciously …? In any case,
> [3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html
> their Mandoc is wrapped in a Go-based environment. Besides some extra
> features Archmanweb doesn't have (for example, better detection of
> cross-references to other man pages if they are not formatted as
> such), the toc creation works, even for the Vietnamese version [4].
> [4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz

The right-hand sidebar in manpages.debian.org is not coming from
mandoc.  I suspect that is generated by some Go code.

> Any idea what is wrong? Well, first I thought the problem is on my
> machine, but Archmanweb shows the same behavior. As a workaround, I
> could produce a few more toc entries by replacing "Ü" with "Ü" and
> similar, but as long as I don't know what rules Mandoc applies
> internally, it's almost impossible to fix. To mention, as one of the
> maintainers of the manpages-l10n project [5], I have to maintain many
> [5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n
> languages, not only my own one …

Wow, you are fluent in many languages?  That's impressive.  :-)

> I consider the online collections just as important as the local
> versions, especially for linking to a specific man page section or
> subsection in email or web,

Yes, that is indeed helpful when answering user questions on mailing list.
Fortunately, the links that really matter typically work even with
translated manual pages because the syntax elements that users need
to learn are usually *not* translated and their English names usually
contain ASCII characters only, for example

  https://man.archlinux.org/man/diff.1.vi#u
  https://man.archlinux.org/man/diff.1.vi#help

> and for searching in man pages which are not installed locally.
> Any help with solving this problem would be appreciated.

Right now, i'm not yet sure what to do about tagging of words that
involve non-ASCII characters.  Maybe we can figure something out,
but how?  Hmmm...
Maybe mandoc should treat any \\[uXXXX] sequence as a letter for
the purposes of tagging?  The code needed for that will look rather
awkward though, and even when implemented perfectly, the tags will
be UTF-8 rather than ASCII-encoded.  Would links like

  https://man.archlinux.org/man/diff.1.de#%C3%9CBERSICHT

really be all that useful?  What do people think?

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


  parent reply	other threads:[~2022-03-25 12:27 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-24 17:13 Mario Blättermann
2022-03-24 17:33 ` Michael Stapelberg
2022-03-24 18:00   ` Mario Blättermann
2022-03-25 12:27 ` Ingo Schwarze [this message]
2022-03-25 16:07   ` Mario Blättermann
2022-03-25 20:58     ` Jan Stary
2022-03-26 12:34     ` Ingo Schwarze
2022-03-26 13:35       ` Mario Blättermann
2022-03-25 16:21   ` Anthony J. Bentley
2022-03-25 21:15     ` Jan Stary
2022-03-26 10:33     ` Ingo Schwarze
2022-03-26 17:55       ` Anthony J. Bentley
2022-03-27 11:17         ` Ingo Schwarze
2022-03-27 11:44           ` Ingo Schwarze
2022-03-25 16:57   ` Mario Blättermann
2022-03-25 20:36     ` Jan Stary
2022-03-25 20:59       ` Mario Blättermann
2022-03-25 21:20         ` Jan Stary
2022-03-26  9:25           ` Ingo Schwarze

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yj21FDsRyhI4eOvN@asta-kit.de \
    --to=schwarze@usta.de \
    --cc=discuss@mandoc.bsd.lv \
    --cc=mario.blaettermann@gmail.com \
    --subject='Re: HTML output: section headers with diacritics not in table of contents' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).